1  
And it begins... =) –  gnostradamus Jun 4 '09 at 14:51
8  
thanks to your question, I came to know about the existence of the dump. –  Sathya Jun 4 '09 at 14:55
11  
Not programming related and in the very least, this should be community wiki. –  Robert S. Jun 4 '09 at 16:29
 
This should definitely be CW. I'm not going to vote to close though. –  Zifre Jun 4 '09 at 20:58
1  
I'm not sure why this should be CommunityWiki'd? I want to encourage people to do contribute - rep is a good way of doing this, and CW question == CW answers (I don't personally care since I've now got more than enough :P).. This isn't a "Jon Skeet facts" or FAQ question - reasonable answers require a lot of effort (and programming!).. This answer on the CW sofaq pretty much sums up my view stackoverflow.com/questions/128434/… –  dbr Jun 4 '09 at 22:09
4  
@dbr: it's SO-related, not programming-related as such. Indeed, this has long been one of the categories where CW has proved most useful, by encouraging editing over posting, aggregating useful information over verbose navel-gazing. And FWIW, i find it hard to understand how you feel "penalized" by anyone here... –  Shog9 Jun 4 '09 at 22:42
 
I disagree, but it's community wiki'd –  dbr Jun 4 '09 at 22:48
 
'i find it hard to understand how you feel "penalized" by anyone here...' - I do not, rather I want to "reward" others posting answers (by ~giving them rep) –  dbr Jun 4 '09 at 22:53
6  
What are you guys using to read/analyze these files? Are there any programs/libraries out there that won't choke on an 800MB XML file? I've been working on a program to parse the data and insert it into a SQLite database, but I'm wondering if there are any more direct approaches I could take instead. –  Kyle Cronin Jun 6 '09 at 3:55
6  
@nobody_: any decent stream parser (SAX-based, etc) should be able to handle 800MB of XML easily. Just avoid DOM-based parsers! –  Shog9 Jun 6 '09 at 18:23
 
@dbr: CW posts still get badges, which is a reward. –  Eddie Jun 7 '09 at 4:12
 
@nobody_ Xalan should do it. –  Kuroki Kaze Jun 8 '09 at 17:09
 
Now that accept rate percentages are displayed I'm assuming that the rates for the most engaged users are going up as they accept answers on old questions. I'd like to see what questions haven't been given an accepted answer in that initial surge. e.g. a list of questions with no accepted answer from users with 95-99% accept rate. –  Sam Hasler Aug 27 '09 at 18:01
add comment

migrated from stackoverflow.com Jul 21 '09 at 7:04

This question came from our site for professional and enthusiast programmers.

11 Answers

I wrote a tutorial on importing the Stack Overflow XML files into SQL Server that includes how to bring it into SQL Server, what the tables and fields mean, and how to query it.

I've also built spwho2.com, a front-end for some of the slicing and dicing I'm doing. I've got reports like:

I'm working on reports about users, answers, badges and comments next.

Update 2009/6/9 - User pages are completely online, although folks with wacko display names are going to have a hard time finding theirs. You can check out Jon Skeet as an example, or go to the home page at http://spwho2.com and click on the first letter of your display name. I'll build a search-box style lookup later. On the individual user stat pages, sometimes the statistics are out of order, and I haven't vetted the statistics yet, so if you see anything odd let me know.

Update 2009/6/11 - On user pages, I've added percentiles so you can see where you rank in relation to other users overall, or for a specific tag. For example, on Jon Skeet's page, you can see that he scored in the 99.89% percentile for Submitted Answers on the .net tag, so you might wonder who beat him. Click on the link for Submitted Answers for .net, and you can see the top 100 submitters for that tag. If you go back to his page, another interesting statistic is that he averaged 0.31 hours to answer .net questions, which might sound pretty good - but he only scored 37.25% there. Click on the link for Avg Hours to Answer .NET questions and you'll see that users are averaging as low as .05. The tricky part to that is that they may have answered a lot less questions, so their average looks artificially low. I still haven't vetted most of these, just having fun building pages out. There's bugs with some tags that have # signs in them, for example.

Update 2009/6/14 - How much would you pay for a web site that does all this? $19.95? $29.95? But wait - there's more! I've added:

  • A user & tag search
  • On the tag pages, I'm now showing the top 10 related tags for each tag (the tags that also seem to be attached most often to questions)
  • On the tag pages, I'm showing the 50 oldest unanswered questions if you want to swoop in and find easy questions to answer. Keep in mind that it's first come first serve - the pages will only be updated when a new database export comes out, so by the time you visit your favorite tag, some of the questions may have been answered by someone else.

Update 2009/6/26 - I'd always had a lot of links in the site back to Stack Overflow, but I've added even more after reading Jeff's post about attribution. The user names are now linked back to their Stack Overflow profile, plus there's a footer on every page that links both to Stack Overflow and to the SO blog. I already have every question linked straight back to the Stack Overflow question - I don't want to include the body of the questions (or the answers) on SPWho2 itself, because I'm not after the Google traffic. I haven't included the questioner names next to the questions yet, but I'll do that in the next round of updates when the next database dump comes out.

Update 2009/7/13 - I just updated it with the June data dump. I haven't QA'd the data due to an upcoming vacation, but hey, that's what you're for, ha ha ho ho.

Update 2010/02/13 - I set up a public-facing SQL Server with the Server Fault, Stack Overflow, Super User, and Meta Stack Overflow data dumps. You can connect to it with SQL Server Management Studio 2008 or Toad for SQL Server. Connection info is at Querying the Stack Overflow Data Dump.

share|improve this answer
 
nice functionality Brent, thanks for doing a bunch of the grunt work for us. –  Nathan Koop Jun 8 '09 at 19:44
 
It's kind of interesting that you see mac, linux, vista, and osx in a row in the most closed tags... –  Zifre Jun 10 '09 at 0:09
1  
I think that's because they're typically PC help questions, not programming questions. They get closed when they're not programming related. One of the things I'm working on is tag correlations - seeing which tags show up in combination most often. I bet we'll see "not-programming-related" show up around those. –  Brent Ozar Jun 10 '09 at 3:14
1  
Brent, your user pages don't work for all users - I get a 404 when I put in my user ID. –  Kyle Cronin Jun 10 '09 at 4:47
 
Yeah, that's fixed in the latest update. –  Brent Ozar Jun 11 '09 at 16:59
 
This is great stuff! Who knew "unicode" was so popular?? –  JP Alioto Jun 16 '09 at 5:16
 
@brent have a look at my github tool (linked from the question) it imports the data in a fraction of time it takes the xml openrowset stuff. –  waffles Jun 16 '09 at 5:17
 
Sam - looks interesting, but is there a compiled version? I don't have Visual Studio, and wouldn't know where to begin with that. –  Brent Ozar Jun 16 '09 at 12:04
 
@Brent here you go github.com/sambo99/So-Slow/downloads –  waffles Jun 16 '09 at 22:34
 
I am really confused; what exactly does the percentile mean? –  Paolo Bergantino Jun 27 '09 at 9:47
 
The percentile means where you rank for that particular statistic. If your percentile is 86%, then that means you've performed better than 86% of the other people out there. If your percentile is 100%, then you've performed better than everybody. If your percentile is 15%, then that means you've got a lot of work to do, because 85% of the users are outperforming you on that statistic. –  Brent Ozar Jun 27 '09 at 14:03
 
Mm, I get it. It just wasn't making sense under a few of the categories but I guess it does. Thanks and nicely done. –  Paolo Bergantino Jun 27 '09 at 21:30
 
Thanks! I'll add an explanation about it in the site too. –  Brent Ozar Jun 27 '09 at 22:03
 
fyi, your import procedures no longer work for the posts table. The table is too big to have all in one xml variable/cell at one time. –  Joel Coehoorn Oct 29 '09 at 23:00
add comment

I used Wordle to visualise common substrings in titles (the bigger the more common and/or longer recurring phrases). I add some more fuzziness to my algorithm because there's awfully many synonyms in there.

share|improve this answer
3  
That's one of the prettier visualisations of SO data I've seen so far! I agree it'd be nice to have phrases like "What is the best way" and "What's the best way" combined –  dbr Jul 6 '09 at 14:40
1  
Very cool! Can you post a high-res version of this? –  Andrew Feb 14 '10 at 15:08
1  
+1 who knew "what is the best way" to ask a question? LOL –  Question-Failed Jul 27 '10 at 10:14
 
I find it ironic that now StackOverflow is essentially banning "what is the best" questions, despite them being the most sought after ones. –  Dan Dascalescu 2 mins ago
add comment

These are based on the so-export-2009-06 data-dump..

Boring stuff..

  • Uncompressed there is 1.3GB of XML data (206.1MB compressed using .7z)

Total number of..

.. users: 88,558 [1]

.. reputation points: 16,199,960 (average of ~182 rep per user) [2]

.. badges awarded: 234,599 [5]

.. questions: 182,742 [3]

.. answers: 698,923 [3]

.. votes: 2,379,537 [4]

UpMod                 1915441
DownMod                178300
AcceptedByOriginator   109549
Deletion                22107
Undeletion               1660
Close                    1492
BountyStart              1449
BountyClose              1379
Offensive                 365
Reopen                    121
InformModerator           112
Spam                       98

Top Ten..

..badges

Teacher           31416
Student           29526
Supporter         25362
Scholar           24150
Editor            23450
Nice Answer       22978
Autobiographer    12751
Critic            11609
Commentator        9831
Popular Question   8721

Also, each of the following badges was awarded only once: [6]

  • asp.net-mvc
  • best-practices
  • cocoa
  • django
  • eclipse
  • f#
  • iphone
  • jquery
  • language-agnostic
  • performance

.. viewed questions [7]

  1. What is the best comment in source code you have ever encountered? - 297,365 views
    • What's your favorite "programmer" cartoon? - 140,287 views
    • Programmer Jokes -- what's your best one? - 107,065 views
    • What real life bad habits has programming given you? - 100,058 views
    • Great programming quotes - 54,046 views
    • What is your favorite "programmer" t-shirt? - 52,027 views
    • Hidden Features of C# - 49,927 views
    • The Coolest Server Names - 48,448 views
    • Jon Skeet Facts? - 35,868 views
    • How Does Stackoverflow Work? (The Official FAQ) - 34,614 views

.. voted questions [8]

  1. What's your favorite "programmer" cartoon? - 682 votes
    • How Does Stackoverflow Work? (The Official FAQ) - 621 votes
    • Hidden Features of C# - 585 votes
    • Could we please be a bit nicer to the noobs? - 451 votes
    • Programmer Jokes -- what's your best one? - 385 votes
    • What is the best comment in source code you have ever encountered? - 359 votes
    • Using what I've learned from stackoverflow. (HTML Scraper) - 352 votes
    • What is the single most influential book every programmer should read? - 340 votes
    • What do you use to keep notes as a developer? - 299 votes
    • What real life bad habits has programming given you? - 269 votes

.. voted answers [9]

.. most answered question [10]

  1. "What is the best comment in source code you have ever encountered?" [184618] - 533 points

.. most comments on posts [11]

  1. "Since SQL Server doesn't have packages, what do programmers do to get around it?" [770300] - 107 comments

Graph..

.. age vs reputation

.. age vs account age

.. rep vs number of questions

.. rep vs number of answers

.. question-to-answer ratio

Queries

Some of the above data was gathered using SQL queries on this database. For future reference, here are the queries used to generate the data:

 1: select count(id) from users;
 2: select sum(reputation), sum(reputation)/count(id) from users;
 3: select posttypeid,count(id) from posts group by posttypeid;
 4: select votetypeid, count(id) from votes group by votetypeid;
 5: select name, count(name) from badges group by name order by count(name) desc limit 10;
 6: select name from badges group by name having count(name) = 1;
 7: select title, viewcount from posts order by viewcount desc limit 10;
 8: select title, score from posts where posttypeid = 1 order by score desc limit 10;
 9: select id, score from posts where posttypeid = 2 order by score desc limit 10;
10: select id, title, answercount from posts order by answercount desc limit 10;
11: select id, title, commentcount from posts order by commentcount desc limit 10;
share|improve this answer
 
You state the average rep per user as 184; John D. Cook is 364. Thoughts? –  Gavin Miller Jun 4 '09 at 16:27
3  
One thing that I want to do is verify the information in the dump before analyzing it. For example, the dump says my age is 22, but it's actually 21 (SO has it correct). I've checked a few other ages and they're all one more than they should be in the dump. –  Kyle Cronin Jun 4 '09 at 16:48
1  
@nobody_ there is a bug in the dump. All the ages are one year older. –  Zifre Jun 4 '09 at 20:59
2  
@Zifre: Is it confirmed that all ages are 1 more than they should be? I thought it might be a careless subtraction of the years - i.e. 2009-1987 = 22, but my birthday doesn't come until September so I'm still 21. However, there's a simple test - can someone that already had their birthday this year check their age and see if it's accurate or still +1? –  Kyle Cronin Jun 5 '09 at 12:10
 
@nobody_: I checked and mine is right, but it's not really a reference, as my birthday was last week... –  fretje Jun 6 '09 at 18:57
 
Thanks nobody_, both for filling in a bunch of the headings, and the sqlite DB, which makes things.. slightly easier than parsing XML files! –  dbr Jun 10 '09 at 23:49
 
..opps, queries 8-10 are currently based on the 2009-05 database, only just noticed the June data-dump! I'll update them when the download completes –  dbr Jun 11 '09 at 0:19
add comment

There are 72 questions on SO that have an Answer marked correct that is both lower in score than the highest scoring answer AND are scored lower than -1.

Here is the list of bad, marked correct, answers:

SQL:

select Id, 
    (select max(a.Score) from Answers a where a.ParentId = q.Id ) as MaxScore , 
    (select a.Score from Answers a where a.Id = q.AcceptedAnswerId) as  SelectedScore
into #t
from Questions q
where AcceptedAnswerId is not null 

select '- http://stackoverflow.com/questions/' + cast(Id as varchar) 
, SelectedScore
, MaxScore
from #t where SelectedScore < MaxScore
and SelectedScore < -1
order by selectedscore asc
share|improve this answer
1  
I wonder what's caused the formatting to break - I looked at the source and it seems fine. –  Kyle Cronin Aug 13 '09 at 14:41
 
@waffles: the first one in the list, stackoverflow.com/questions/168427, does not seem to have an answer with a score of -25. Is it deleted, only visible to users with more than 10000 reputation? –  Peter Mortensen Feb 14 '10 at 20:39
 
@Peter, this list is months old, it needs updating –  waffles Feb 14 '10 at 21:53
add comment

No matter what Jeff says, Fastest Gun in the West is still a massive problem:

If you are the first to post an answer you have an almost 50% chance of it getting accepted, if you post the 5th answer you only have a 2% chance of getting your answer accepted.

Full results are below:

0   50477	46.305
1   26680	24.475
2   14516	13.316
3   7609	6.980
4   4023	3.691
5   2136	1.959
6   1244	1.141
7   790	0.725
8   521	0.478
9   285	0.261
10  193	0.177
11  125	0.115
12  116	0.106
13  66	0.061
14  43	0.039
15  40	0.037
16  28	0.026
17  25	0.023
18  15	0.014
19  14	0.013
20  8	0.007
21  4	0.004
22  2	0.002
23  5	0.005
24  8	0.007
25  1	0.001
select q.Id , 
    (  select count(*) from Answers a1 
    		where a1.CreationDate < (select a2.CreationDate from Answers a2 where a2.Id = q.AcceptedAnswerId)
        		and a1.ParentId = q.Id
    ) as AcceptedAnswerRank 
into #t
from Questions q
where AcceptedAnswerId is not null


select AcceptedAnswerRank, count(*), cast(((count(*) + 0.0) / (select count(*) + 0.0 from #t)) * 100.0 as Numeric(6,3))   from #t
group by AcceptedAnswerRank
order by AcceptedAnswerRank asc
share|improve this answer
6  
How much do the stats change if you filter out questions with only one answer? (from a comment on meta.stackoverflow.com/questions/73/… ) –  dbr Jun 29 '09 at 1:57
 
@waffles: very nice stats, it's almost a geometrical progression :D –  Răzvan Panda Mar 16 '12 at 10:47
 
@waffles: Not sure, but it seems the 6th answer has a 2% chance of getting accepted. –  Răzvan Panda Mar 16 '12 at 10:48
add comment

Well I've been mucking around with the data.

And came up with (a probably meaningless) stat that calculates the odds that when you post an answer it becomes accepted. Unfortunately there seems to be no way to filter out community wiki stuff (Jeff let me know this will be in the next dump). So that is mixed in.

For me when I post an answer about 21% percent of the time it will get accepted. The person with the highest ratio, is not Jon Skeet :) The prize goes to rq, who 73% of the time posts the accepted answer to a question. (I excluded people with less than 1000 reputation.)

Top 16 are:

rq                  4596    33  45  0.733333333 http://quirkygba.blogspot.com
Daniel LeCheminant  62055   113 189 0.597883598 
NULL                17637   16  27  0.592592593 NULL
Jb Evain            36702   19  34  0.558823529 http://evain.net/blog/
Todd White          30833   16  29  0.551724138 http://code.logos.com
Rafael              80720   24  44  0.545454545 http://kukawski.pl
Eric Rosenberger    41624   30  56  0.535714286 
Paolo Bergantino    16417   350 679 0.515463918 http://www.rootspot.com
Alex Koshelev       19772   17  33  0.515151515 http://webnewage.org/
Ben Gottlieb        6694    108 212 0.509433962 http://www.standalone.com
JeniT               6739    11  22  0.5         http://www.jenitennison.com/
Bittercoder         4843    18  36  0.5         http://blog.bittercoder.com
John Siracusa       164     19  38  0.5         http://arstechnica.com/staff/fatbits/
Ayman               40005   76  152 0.5         http://aymanh.com
Gdeglin             83491   33  66  0.5         http://www.inigral.com
Miles               64474   34  68  0.5

select OwnerUserId, case 
    when exists (select 1 from Posts p2 where p.ParentId = p2.Id and p2.AcceptedAnswerId = p.Id) then 1
        else 0 
    end as ItsRight
into #t
from Posts p 
where PostTypeId = 2


select OwnerUserId, sum(ItsRight) as [Accepted Answers], count(*) as [Total Answers],  
  (cast (sum(ItsRight) as float) / cast(count(*) as float)) as Ratio 
into #UserRatios
from #t
group by  OwnerUserId
having sum(ItsRight) > 0 and count(*) > 20 
order by cast (sum(ItsRight) as float) / cast(count(*) as float) desc

select DisplayName, r.*, WebsiteUrl, Reputation from #UserRatios r
join Users on Id = OwnerUserId
where Reputation > 1000
order by Ratio desc
share|improve this answer
add comment

I wrote a small program to import the data in a PostgreSQL database. My first analysis was on the interval between an article and the votes:

  • 54 % of the votes occur the same day as the post ("fastest gun in West" syndrome)
  • but you still have 27 % of the votes which occur more than a week after the post

I also studied the interval between a question and an accepted answer.

Graphs and technical details can be found in my original article (in French).

share|improve this answer
add comment

Here are some things I found from analyzing data that Jeff gave me before making the data generally available.

Question statistics

Reputation statistics

Voting patters

share|improve this answer
 
You state the average rep per user as 364; dbr's below is 184. Thoughts? –  Gavin Miller Jun 4 '09 at 16:26
4  
John's analysis happened months ago; there have presumably been more users joining since. –  Paolo Bergantino Jun 4 '09 at 18:15
 
Also, I imagine the data-dump will include more users than John's - the paragraph after only describes 61,213 users, the data-dump seems to contain 86,110 –  dbr Jun 4 '09 at 22:13
2  
You should probably make this CW now that the question is. –  Zifre Jun 6 '09 at 17:27
add comment

I added it all up in my head, and found that Jon Skeet's reputation velocity is a significant fraction of the speed of light.

Which is good, because photons could use the competition!

share|improve this answer
add comment

Very interesting stats. I have imported the XML into a MySQL database and the data dump is available here http://stackoverflow.com/questions/1008141/stackoverflow-data-for-mysql-closed

Happy data mining!

share|improve this answer
 
That question has been deleted. –  balpha Feb 14 '10 at 19:50
add comment

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .