Tuesday, December 6, 2016

Stack Overflow in the City

When I taught research methods, I required my students to write a research proposal. Though nearly all of them chose methods that involved collecting primary data (that is, collecting the data to answer your question yourself), usually directly from the participants in the form of surveys or measures, I encouraged them to consider alternative ways to answer their questions, including by more indirect measures (such as observation) and even by drawing upon secondary data (data collected by someone else). Science is about, first and foremost, answering questions and testing hypotheses. There's nothing that says you have to be directly involved with every aspect of the study, as long as you design the study rigorously and in such a way to allow you to answer your question/test your hypothesis. In fact, there are times when it is far more justified to use data and tools that have already been developed.

One of my favorite blogs, Variance Explained, does a lot of interesting work using secondary data, such as this examination of Trump's tweets. A few days ago, he posted yet another secondary data analysis, that examines whether software developers in different cities use different technologies and programming languages. Sure, he could have sent out surveys to programmers in different locations. But instead, he used Stack Overflow traffic data to answer his question. (I should note that he works at Stack Overflow, and so has access to data that we do not, but he does share the code he used to generate the data).

He examines data from the four metropolitan areas accounting for the most Stack Overflow traffic: San Francisco, Bangalore, London, and New York (where he is based). First, he compares the two US cities.

One clear difference: New York has a larger share of Microsoft developers. Many tags important in the Microsoft technology stack, such as C#, .NET, SQL Server, and VB.NET, had about twice as much traffic in New York as in San Francisco. This may be because many banks and financial firms, which are much more common in NY than in SF, use these technologies.

There are also patterns in the technologies that are more common in the San Francisco area, especially languages developed by Apple (Cocoa, Objective-C, OSX) and Google (Go, Android). We can also see several influential open source projects, especially ones associated with Apache (Hive, Hadoop, Spark).
When he expands his analysis to include all four cities, he finds that London has the highest proportion of developers using the Microsoft stack, New York has a higher proportion using data science tools (like pandas for Python and R - which I also use), and Bangalore leads in Android development. Even after bringing in the other two cities, San Francisco still leads in the same technologies listed above, except Android.

No comments:

Post a Comment