R-Chart: Google AI Challenge: Languages Used by the Best Programmers

Thursday, December 2, 2010

Google AI Challenge: Languages Used by the Best Programmers

The Google AI Challenge recently wrapped up with a Lisp developer from Hungary as the winner. The competition challenges contestants to create bots that push the limits of AI and game theory. These bots compete against one another, and a complete ranking of competitors is available. The big story today is that the winner (Gábor Melis) used Lisp to beat out over 4000 other contestants around the world using a host of different programming languages.

Paul Graham has stated that Java was designed for "average" programmers while other languages (like Lisp) are for good programmers. The fact that the winner of the competition wrote in Lisp seems to support this assertion. Or should we see Mr. Melis as an anomaly who happened to use Lisp for this task?

Programming Languages Usage

Java, C++, Python and C# were heavily used overall.

language count(*)

1 Java 1634
2 C++ 1232
3 Python 948
4 C# 485
5 PHP 80
6 Ruby 55
7 Haskell 51
8 Perl 42
9 Lisp 33
10 Javascript 19
11 C 18
12 OCaml 12
13 Go 6
14 Scala 4
15 Groovy 1

In the Top 200

language count(*)

1 Java 70

2 C++ 64

3 Python 34

4 C# 17

5 C 4

6 Haskell 3

7 PHP 3

8 Ruby 2

9 Javascript 1

10 Lisp 1

11 OCaml 1

Top 100

1 Java 33
2 C++ 32
3 Python 20
4 C# 9
5 C 3
6 Haskell 1
7 Lisp 1
8 OCaml 1

Top 10

language count(*)

1 Java 4

2 C++ 3

3 C# 2

4 Lisp 1

The plot above is a bit difficult to discern due to the number of languages represented (and similarity in colors). So here is a breakdown by language.

Lisp does appear to be skewed towards higher ranking. But even more striking are the C hippies:

The functional crowd represented with Haskell also ranked on the higher end:

How about Java? There is a trend towards the average - but a significantly larger number of entrants used Java. It also is a language taught in many colleges, and might reflect greater student participation in these languages (although MIT did focus on Lisp back in the day...).

How about representatives from the Microsoft? Einstein and Elvis showed up - Mort was not interested.

I can post charts of other languages if anyone asks - otherwise, download the files for yourself and draw your own conclusions. And congratulations to

Gábor Melis - I am again feeling the inspiration to delve into the mysteries of Lisp and meander among mountains of parenthesis...

Methodology Used
No need to proceed further unless you are interested in how the results listed above were derived.

Basically, I used Ruby to scrape the results from the Google AI Rankings site. The results were read into Ruby, and ggplot2 and sqldf libraries were used to analyze the results.

Get the Data into R
So to find out more...I whipped up a ruby script to create a delimited file from the 47 page listing online. (Feel free to get these from their GitHub location and do some additional validation/analysis of your own). Read this file into R:

df<- read.csv('googleAI2010.csv',sep=';',header=FALSE)
df$V7 <- NULL
names(df)<- c('rank', 'username','country','organization','language','elo_score')

Sanity Check
Most of this work can be done in idiomatic R (which has some significant Lisp influences) - which might be a better way to honor the winner. However, I find myself using sqlite more and more these days - particularly in mobile development. So I used the sqldf library which uses this database behind the scenes.

Country rankings are available online, and the following emulates these results. Specifically, the number of entrants in the top 200 ranked contestants from each country can be derived as follows:

library('sqldf')

top200=df[df$rank <= 200,]

sqldf('select country, count(*) from top200 group by country order by 2 desc')

Organization rankings are similar, representing the top organizations within the top 100. There are some anomalies here, the highest ranking "Other" is not shown in the online version for obvious reasons, and only the most of these have only one entrant in the top 100 an are listed in an arbitrary manner. However, the results are otherwise the same in R.

top100=df[df$rank <= 100,]
sqldf('select organization, count(*) from top100 group by organization order by 2 desc')

R Code
The following are additional snippets of R code used to generate the results above.

# Language Usage

sqldf('select language, count(*) from df group by language order by 2 desc')

sqldf('select language, count(*) from top200 group by language order by 2 desc')
sqldf('select language, count(*) from top100 group by language order by 2 desc')

top10=df[df$rank <= 10,]
sqldf('select language, count(*) from top10 group by language order by 2 desc')

If you fiddle enough with the bucket size for histograms, you might be able to draw some conclusions... but the density plot seemed like a nicer option.

library('ggplot2')

# Substitute your favorite language of those available for Lisp below
qplot(data=df[df$language=='Lisp',], x=rank, geom='histogram', binwidth=1000) + opts(title='Lisp')

# The density plot at the top of this posting:

ggplot(data=df, aes(rank, fill=language)) +
geom_density(alpha = 0.2) +

xlim(0,5000) +

opts(title='2010 Google AI Challenge Rankings')

ggsave('program_language_density_plot.png')

# Breakdown by language:

ggplot(data=df[df$language=='Scala',], aes(rank, fill=language)) + geom_density(alpha = 0.2) + xlim(0,5000) + opts(title='Scala')

Update: I have been keeping up with the comments - and sketched out some other ways of looking at the data in another post.

6 comments:

Kevin WDec 3, 2010 06:55 AM
The first graph illustrates a severe weakness with HSV color palettes. It is barely possible to distinguish 6 different colors, but more than that is almost impossible.
ReplyDelete
BrandonDec 4, 2010 10:20 AM
The axis scaling for all the charts should probably be the same.
ReplyDelete
Mohammad ElsheimyDec 5, 2010 10:04 AM
Java, C++, then C#. Not surprisingly, the graphs say that:
http://wismuth.com/lang/languages.html
http://langpop.com/
ReplyDelete
Andrei SosninDec 6, 2010 04:36 AM
I believe, it would be also interesting to see statistics based on the points gained by the contest entrants, not just their ranking positions.

I've made these charts here based on the data you've published (in Excel though), but using the points instead of rankings:

http://0000b.blogspot.com/2010/12/my-google-ai-challenge-participation.html

Of course, it assumes, that points gained are somehow representative of the actual performance of the bot (I believe they are to some extent, at least). One of the charts also demonstrates the inequalities among the points distribution (1st and 2nd places have 300 points of difference, while in the middle the points are distributed much more closely).
ReplyDelete
Andrei SosninDec 6, 2010 05:35 AM
It would also be much more useful to draw the first chart in your post as a cumulative. This way it would be much easier to see the distribution of languages across the ranking positions.
ReplyDelete
Thomas LevineJun 19, 2011 11:24 PM
#Some other metrics that take into account the rank and number of people using the language

#Median ranks by language (Low numbers indicate that programmers using that language were good.)
> sort(sapply(levels(df$language),function(a) median(df$rank[df$language==a])));plot(.Last.value)
C Scala Javascript Haskell Lisp OCaml PHP
1123.5 1155.0 1196.0 1493.0 1748.0 1799.5 2086.0
Perl Ruby Java C++ C# Python Go
2128.0 2164.0 2226.0 2420.0 2438.0 2520.0 3242.5
Groovy
4250.0
#Chance that a particular entry is in the top 100 assuming that it depends only on language (High numbers indicate that programmers using that language were good.)
> a<-table(df$language[df$rank<=100])/table(df$language);sort(round(a[a>0],2));plot(.Last.value)

C# Haskell Java Python C++ Lisp OCaml C
0.02 0.02 0.02 0.02 0.03 0.03 0.08 0.17
ReplyDelete

Add comment

R-Chart

Thursday, December 2, 2010

Google AI Challenge: Languages Used by the Best Programmers

6 comments:

Blog Archive

Recent Comments

Links

Labels

About Me