MapReduce: Hot github projects contribution by timezones

When I decided to do some small projects as exercises of learning hadoop, I almost immediately had this idea of doing git contribution analysis, not only because the git log data is publicly available and easy to access, but also because this result could be interesting and useful for someone.

In this post, I'll do two simple analysis by two mapreduce jobs. The first one reveals the contribution ratio of each timezone, and the second one shows by timezone how many of these contribution were done in overtime.

Input Data

I manually collect data for 8 popular repos which are related to big data (as I'm learning big data). They're activemq, hadoop, hbase, hive, kafka, spark, storm, zookeeper. The analysis is done in these single repos and the combination of them all.

I did not try github public API for fetching commit history in batch, as I don't have a powerful cluster for computing really large data. So although this code could be used on large data sets, I did not try it.

I use git log > xx.log to get the commit history of each repo, and put these files into a folder named input.This is a typical commit record. I need the author email, the date time and the time zone. The whole analysis is based on this hypothesis that all the contributors correctly set their timezone.


commit adf1cdf3d5abed0dae76a4967129ce64f2e16c2f
Author: Chris Nauroth <cnauroth@apache.org>
Date:   Thu Mar 10 14:49:08 2016 -0800

HADOOP-12899. External distribution stitching scripts do not work correctly on Windows.

First MapReduce

Custom Delimiter

The commit record lies in multiple lines and the line number is undetermined, but luckily each record start on a new line with "commit". So I code like this:


Configuration conf = new Configuration();
conf.set("textinputformat.record.delimiter", "\ncommit ")

Mapper

It simply reads the commit record, writes a tuple (timezone, overtime, email) as the key, and writes 1 as the value. I define the commit time range of 8:00 -> 19:00 is normal work time, even in weekend. Other time is considered overtime. This is the sample map output.

-8:0:cnauroth@apache.org 1

Custom Partitioning

Each timezone get two partitioner, one for work time, the other for overtime. So there're 48 partitions in total.


public class TimezonePartitioner extends Partitioner{

    @Override
    public int getPartition(TripleKey key, LongWritable value, int numReduceTasks) {
        if(key.getOvertime()){
            return (key.getTimezone() + 12) * 2;
        }
        else{
            return (key.getTimezone() + 12) * 2 + 1;
        }
    }
}

Reducer

The reducer adds the values of the same key to get this author's total commit number. Sample output:


-8:0:cnauroth@apache.org 10
-8:0:txfe@gmail.com 45

Second MapReduce

The output of the first MapReduce job is used as the input.

Mapper

Simply reads input key/value and output the same.

Custom Grouping

Although the key is a tuple of three, the grouping needs to be based on a pair key (timezone, overtime), so that both the author number and the commit number can be calculated in reducer.


public static class FirstTwoOnlyComparator extends WritableComparator {
    public FirstTwoOnlyComparator() {
        super(TripleKey.class);
    }

    @Override
    public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
        int i1 = readInt(b1, s1);
        int i2 = readInt(b2, s2);

        int compare1 = Integer.compare(i1, i2);
        if(compare1 != 0){
            return compare1;
        }

        boolean first1 = b1[s1+4] == 1;
        boolean first2 = b2[s2+4] == 1;
        return Boolean.compare(first1, first2);
    }
}

job.setGroupingComparatorClass(TripleKey.FirstTwoOnlyComparator.class);

Reducer

Sum the grouped values to get the total commit number, and the grouped values' record length is the total author number.


public void reduce(TripleKey key, Iterable values, Context context)
throws IOException, InterruptedException {
    long sum = 0;
    long size = 0;
    for (LongWritable val : values) {
        sum += val.get();
        size += 1;
    }
    result.setCommitNumber(sum);
    result.setAuthorNumber(size);

    result_key.setTimezone(key.getTimezone());
    result_key.setOvertime(key.getOvertime());

    context.write(result_key, result);
}

The final output is as follow. The numbers in each time are timezone, overtime, author_number and commit_number.


-10:0	4:6
-10:1	2:8
-8:0	654:5574
-8:1	357:2177

Post Processing

I write shell script to run 9 single repo calculation and 1 combined calculation, and use highchart.js to collect the data and draw the pie charts and bar charts automatically.

And to give timezone a more readable name, I reference this msdn page to map the timezone integer to major cities.

Charts

Conclusion

The biggest three contribution timezones are +00, -07, -08 timezones, which relates to North America and Western Europe.

Much of these work are done in overtime especially for +00 timezones. Thanks to all this contribution.

People in zone +09 which relates to Japan/Korea, are not among the biggest contributors, but they're really hard working.

I need to be more diligent to be one of the contributors.

The conclusion item 2 is hard to believe. So maybe there's something wrong with the whole method. You're welcome to point out any mistakes, or share your ideas about git contribution analysis.

Source code is in this repo https://github.com/AlunYou/AlunYou.github.io.