MapReduce: Hot github projects contribution by timezones
Input Data
activemq, hadoop, hbase, hive, kafka, spark, storm, zookeeper.
The analysis is done in these single repos and the combination of them all.
git log > xx.log to get the commit history of each repo, and put these files into a folder named input.This is a typical commit record.
I need the author email, the date time and the time zone. The whole analysis is based on this hypothesis that all the contributors correctly set their timezone.
commit adf1cdf3d5abed0dae76a4967129ce64f2e16c2f
Author: Chris Nauroth <cnauroth@apache.org>
Date: Thu Mar 10 14:49:08 2016 -0800
HADOOP-12899. External distribution stitching scripts do not work correctly on Windows.
First MapReduce
Custom Delimiter
Configuration conf = new Configuration();
conf.set("textinputformat.record.delimiter", "\ncommit ")
Mapper
-8:0:cnauroth@apache.org 1
Custom Partitioning
public class TimezonePartitioner extends Partitioner{
@Override
public int getPartition(TripleKey key, LongWritable value, int numReduceTasks) {
if(key.getOvertime()){
return (key.getTimezone() + 12) * 2;
}
else{
return (key.getTimezone() + 12) * 2 + 1;
}
}
}
Reducer
-8:0:cnauroth@apache.org 10
-8:0:txfe@gmail.com 45
Second MapReduce
The output of the first MapReduce job is used as the input.Mapper
Custom Grouping
public static class FirstTwoOnlyComparator extends WritableComparator {
public FirstTwoOnlyComparator() {
super(TripleKey.class);
}
@Override
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
int i1 = readInt(b1, s1);
int i2 = readInt(b2, s2);
int compare1 = Integer.compare(i1, i2);
if(compare1 != 0){
return compare1;
}
boolean first1 = b1[s1+4] == 1;
boolean first2 = b2[s2+4] == 1;
return Boolean.compare(first1, first2);
}
}
job.setGroupingComparatorClass(TripleKey.FirstTwoOnlyComparator.class);
Reducer
public void reduce(TripleKey key, Iterable values, Context context)
throws IOException, InterruptedException {
long sum = 0;
long size = 0;
for (LongWritable val : values) {
sum += val.get();
size += 1;
}
result.setCommitNumber(sum);
result.setAuthorNumber(size);
result_key.setTimezone(key.getTimezone());
result_key.setOvertime(key.getOvertime());
context.write(result_key, result);
}
The final output is as follow. The numbers in each time are timezone, overtime, author_number and commit_number.
-10:0 4:6
-10:1 2:8
-8:0 654:5574
-8:1 357:2177
Post Processing
highchart.js to collect the data and draw the pie charts and bar charts automatically.Charts
Select another repository to view:
allrepo