Thursday, December 2, 2010

Google AI Challenge: Languages Used by the Best Programmers





The Google AI Challenge recently wrapped up with a Lisp developer from Hungary as the winner.  The competition challenges contestants to create bots that push the limits of AI and game theory.  These bots compete against one another, and a complete ranking of competitors is available.  The big story today is that the winner (Gábor Melis) used Lisp to beat out over 4000 other contestants around the world using a host of different programming languages.   




Paul Graham has stated that Java was designed for "average" programmers while other languages (like Lisp) are for good programmers.  The fact that the winner of the competition wrote in Lisp seems to support this assertion.  Or should we see Mr. Melis as an anomaly who happened to use Lisp for this task?



Programming Languages Usage


Java, C++, Python and C# were heavily used overall.

     language count(*)

1        Java     1634
2         C++     1232
3      Python      948
4          C#      485
5         PHP       80
6        Ruby       55
7     Haskell       51
8        Perl       42
9        Lisp       33
10 Javascript       19
11          C       18
12      OCaml       12
13         Go        6
14      Scala        4
15     Groovy        1

In the Top 200
     language count(*)
1        Java       70
2         C++       64
3      Python       34
4          C#       17
5           C        4
6     Haskell        3
7         PHP        3
8        Ruby        2
9  Javascript        1
10       Lisp        1
11      OCaml        1


Top 100

1     Java       33
2      C++       32
3   Python       20
4       C#        9
5        C        3
6  Haskell        1
7     Lisp        1
8    OCaml        1

Top 10
  language count(*)
1     Java        4
2      C++        3
3       C#        2
4     Lisp        1


The plot above is a bit difficult to discern due to the number of languages represented (and similarity in colors).  So here is a breakdown by language.

Lisp does appear to be skewed towards higher ranking.  But even more striking are the C hippies:

The functional crowd represented with Haskell also ranked on the higher end:


How about Java?  There is a trend towards the average - but a significantly larger number of entrants used Java.  It also is a language taught in many colleges, and might reflect greater student participation in these languages (although MIT did focus on Lisp back in the day...).
How about representatives from the Microsoft?  Einstein and Elvis showed up - Mort was not interested.

I can post charts of other languages if anyone asks - otherwise, download the files for yourself and draw your own conclusions.  And congratulations to 

Gábor Melis - I am again feeling the inspiration to delve into the mysteries of Lisp and meander among mountains of parenthesis...




Methodology Used
No need to proceed further unless you are interested in how the results listed above were derived.

Basically, I used Ruby to scrape the results from the Google AI Rankings site.  The results were read into Ruby, and ggplot2 and sqldf libraries were used to analyze the results.

Get the Data into R
So to find out more...I whipped up a ruby script to create a delimited file from the 47 page listing online.  (Feel free to get these from their GitHub location and do some additional validation/analysis of your own).   Read this file into R:


df<- read.csv('googleAI2010.csv',sep=';',header=FALSE)
df$V7 <- NULL
names(df)<- c('rank', 'username','country','organization','language','elo_score')


Sanity Check
Most of this work can be done in idiomatic R (which has some significant Lisp influences) - which might be a better way to honor the winner.  However, I find myself using sqlite more and more these days - particularly in mobile development.  So I used the sqldf library which uses this database behind the scenes.

Country rankings are available online, and the following emulates these results.  Specifically, the number of entrants in the top 200 ranked contestants from each country can be derived as follows:




library('sqldf')


top200=df[df$rank <= 200,]


sqldf('select country, count(*) from top200 group by country order by 2 desc')


Organization rankings are similar, representing the top organizations within the top 100.  There are some anomalies here, the highest ranking "Other" is not shown in the online version for obvious reasons, and only the most of these have only one entrant in the top 100 an are listed in an arbitrary manner.  However, the results are otherwise the same in R.




top100=df[df$rank <= 100,]
sqldf('select organization, count(*) from top100 group by organization order by 2 desc')




R Code
The following are additional snippets of R code used to generate the results above.


# Language Usage

sqldf('select language, count(*) from df group by language order by 2 desc')


sqldf('select language, count(*) from top200 group by language order by 2 desc')
sqldf('select language, count(*) from top100 group by language order by 2 desc')



top10=df[df$rank <= 10,]
sqldf('select language, count(*) from top10 group by language order by 2 desc')



 If you fiddle enough with the bucket size for histograms, you might be able to draw some conclusions... but the density plot seemed like a nicer option.  


library('ggplot2')

# Substitute your favorite language of those available for Lisp below
qplot(data=df[df$language=='Lisp',], x=rank, geom='histogram', binwidth=1000) + opts(title='Lisp') 





# The density plot at the top of this posting:

ggplot(data=df, aes(rank, fill=language)) + 
  geom_density(alpha = 0.2) + 

 xlim(0,5000) +

  opts(title='2010 Google AI Challenge Rankings')


ggsave('program_language_density_plot.png')


# Breakdown by language:

ggplot(data=df[df$language=='Scala',], aes(rank, fill=language)) + geom_density(alpha = 0.2) + xlim(0,5000) + opts(title='Scala') 


Update:  I have been keeping up with the comments - and sketched out some other ways of looking at the data in another post.

Share/Bookmark

6 comments:

  1. Kevin WDec 3, 2010 06:55 AM
    The first graph illustrates a severe weakness with HSV color palettes. It is barely possible to distinguish 6 different colors, but more than that is almost impossible.
    ReplyDelete
  2. BrandonDec 4, 2010 10:20 AM
    The axis scaling for all the charts should probably be the same.
    ReplyDelete
  3. Mohammad ElsheimyDec 5, 2010 10:04 AM
    Java, C++, then C#. Not surprisingly, the graphs say that:
    http://wismuth.com/lang/languages.html
    http://langpop.com/
    ReplyDelete
  4. Andrei SosninDec 6, 2010 04:36 AM
    I believe, it would be also interesting to see statistics based on the points gained by the contest entrants, not just their ranking positions.

    I've made these charts here based on the data you've published (in Excel though), but using the points instead of rankings:

    http://0000b.blogspot.com/2010/12/my-google-ai-challenge-participation.html

    Of course, it assumes, that points gained are somehow representative of the actual performance of the bot (I believe they are to some extent, at least). One of the charts also demonstrates the inequalities among the points distribution (1st and 2nd places have 300 points of difference, while in the middle the points are distributed much more closely).
    ReplyDelete
  5. Andrei SosninDec 6, 2010 05:35 AM
    It would also be much more useful to draw the first chart in your post as a cumulative. This way it would be much easier to see the distribution of languages across the ranking positions.
    ReplyDelete
  6. Thomas LevineJun 19, 2011 11:24 PM
    #Some other metrics that take into account the rank and number of people using the language

    #Median ranks by language (Low numbers indicate that programmers using that language were good.)
    > sort(sapply(levels(df$language),function(a) median(df$rank[df$language==a])));plot(.Last.value)
    C Scala Javascript Haskell Lisp OCaml PHP
    1123.5 1155.0 1196.0 1493.0 1748.0 1799.5 2086.0
    Perl Ruby Java C++ C# Python Go
    2128.0 2164.0 2226.0 2420.0 2438.0 2520.0 3242.5
    Groovy
    4250.0
    #Chance that a particular entry is in the top 100 assuming that it depends only on language (High numbers indicate that programmers using that language were good.)
    > a<-table(df$language[df$rank<=100])/table(df$language);sort(round(a[a>0],2));plot(.Last.value)

    C# Haskell Java Python C++ Lisp OCaml C
    0.02 0.02 0.02 0.02 0.03 0.03 0.08 0.17
    ReplyDelete