MapReduce: Hot github projects contribution by timezones
Input Data
activemq, hadoop, hbase, hive, kafka, spark, storm, zookeeper
.
The analysis is done in these single repos and the combination of them all.
git log > xx.log
to get the commit history of each repo, and put these files into a folder named input.This is a typical commit record.
I need the author email, the date time and the time zone. The whole analysis is based on this hypothesis that all the contributors correctly set their timezone.
commit adf1cdf3d5abed0dae76a4967129ce64f2e16c2f
Author: Chris Nauroth <cnauroth@apache.org>
Date: Thu Mar 10 14:49:08 2016 -0800
HADOOP-12899. External distribution stitching scripts do not work correctly on Windows.
First MapReduce
Custom Delimiter
Configuration conf = new Configuration();
conf.set("textinputformat.record.delimiter", "\ncommit ")
Mapper
-8:0:cnauroth@apache.org 1
Custom Partitioning
public class TimezonePartitioner extends Partitioner{
@Override
public int getPartition(TripleKey key, LongWritable value, int numReduceTasks) {
if(key.getOvertime()){
return (key.getTimezone() + 12) * 2;
}
else{
return (key.getTimezone() + 12) * 2 + 1;
}
}
}
Reducer
-8:0:cnauroth@apache.org 10
-8:0:txfe@gmail.com 45
Second MapReduce
The output of the first MapReduce job is used as the input.Mapper
Custom Grouping
public static class FirstTwoOnlyComparator extends WritableComparator {
public FirstTwoOnlyComparator() {
super(TripleKey.class);
}
@Override
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
int i1 = readInt(b1, s1);
int i2 = readInt(b2, s2);
int compare1 = Integer.compare(i1, i2);
if(compare1 != 0){
return compare1;
}
boolean first1 = b1[s1+4] == 1;
boolean first2 = b2[s2+4] == 1;
return Boolean.compare(first1, first2);
}
}
job.setGroupingComparatorClass(TripleKey.FirstTwoOnlyComparator.class);
Reducer
public void reduce(TripleKey key, Iterable values, Context context)
throws IOException, InterruptedException {
long sum = 0;
long size = 0;
for (LongWritable val : values) {
sum += val.get();
size += 1;
}
result.setCommitNumber(sum);
result.setAuthorNumber(size);
result_key.setTimezone(key.getTimezone());
result_key.setOvertime(key.getOvertime());
context.write(result_key, result);
}
The final output is as follow. The numbers in each time are timezone, overtime, author_number and commit_number.
-10:0 4:6
-10:1 2:8
-8:0 654:5574
-8:1 357:2177
Post Processing
highchart.js
to collect the data and draw the pie charts and bar charts automatically.Charts
Select another repository to view:
allrepo