We decided early on that all user-generated content on Stack Overflow would be under a Creative Commons license.

All those great Stack Overflow questions, answers, and comments, so generously contributed by all of you, are licensed under cc-wiki:


cc-wiki license

You are free

  • to Share — to copy, distribute, and transmit the work
  • to Remix — to adapt the work

Under the following conditions

  • Attribution — You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).
  • Share Alike — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.

The community has selflessly provided all this content in the spirit of sharing and helping each other. In that very same spirit, we are happy to return the favor by providing a database dump of public data.

We always intended to give the contributed content back to the community as a whole. Our primary concern was making sure we didn’t have an AOL-style “incident” where we accidentally release personally identifying information in so-called “sanitized” data. Stack Overflow user Greg Hewgill was kind enough to help us beta test several iterations of the data dump, ensuring that we didn’t release anything except content that is visible on the public website. He also suggested several improvements to improve the data dump, so that it contains as much useful public information as possible.

Cheers, Greg! Also, thanks to Stack Overflow Valued Associate #00003, Geoff Dalgas, who patiently worked through many iterations of this to get it together on our end.

The current anonymized public data dump is 205 megabytes, 7zipped, and contains these files:

  1. badges.xml
  2. comments.xml
  3. posts.xml
  4. users.xml
  5. votes.xml

Updated 06/08/09: the following are fixed in the June (06-09) dump

  1. Slightly more data (May dump was taken at the end of May)
  2. ParentID is present for Answers (PostTypeId = 2)
  3. AcceptedAnswerID is present for Questions (PostTypeId = 1)
  4. Fixed any invalid XML data in all files
  5. Named the file .7z so people better understand what compression to use

Download the Stack Overflow Creative Commons Data Dump via BitTorrent

Our plan is to create a new data dump every month, reflecting all data in the system up to that month. We will seed the latest and greatest dump (at a low bitrate) as long as we can, ideally permanently.

And yes, it’s still fun to say “data dump”. We look forward to seeing what the community can do with this data!

« Podcast #56
Podcast #57 »

84 Responses

  1. Mathias Bynens says:


  2. Marc-Andre Lureau says:

    Thank you.

  3. jaskirat says:


  4. Jon Skeet says:

    Fantastic news. Looking forward to analysing some of this :)


  5. Mats Lindh says:

    It’s already been said, but: Awesome.

  6. Konrad says:

    Thanks SO team (and Greg), this rocks. Like you, I’m really looking forward to see how this data gets used.

  7. Cletus says:

    Awesome! GJ guys!

  8. Nick Berardi says:

    Wonderful, news.

  9. Khaja Minhajuddin says:

    Simply Awesome !!
    You guys may say that I am being a pessimist, But I never thought that Jeff and Joel would release these “data dumps”. Now all I have to do is simply create a new SORIP.COM and upload the data to it ;)

  10. Greg Hewgill says:

    Excellent! I’ve downloaded the torrent and will be seeding for as long as my server holds up.

    There’s one bit of data that probably needs explanation. The VoteTypeId field in votes.xml can be one of the following values:

    1 AcceptedByOriginator
    2 UpMod
    3 DownMod
    4 Offensive
    5 Favorite
    6 Close
    7 Reopen
    8 BountyStart
    9 BountyClose
    10 Deletion
    11 Undeletion
    12 Spam
    13 InformModerator

  11. VonC says:

    From Rhino (Bolt - http://thephotoshopper.blogspot.com/2009/05/bolt-out-of-blue.html )

    > You’re beyond awesome! You’re… be-awesome!

    Still from Rhino:

    > the impossible can become possible if you’re awesome!

  12. Brent Ozar says:

    Really cool. Up to 17 seeds already, and I’ll leave mine up permanently too. Setting up to do a quick video on how to data mine it with SQL Server Analysis Services this morning.

  13. tinkertim says:

    Hmm… code_swarm here we come :) Now just to get this into a format that will work for that. May try converting to wikimedia style, then use wiki_swarm.

    If I get it working, I’ll post a link.

  14. Lior sion says:


    Sound like good news but i’m wondering about the share alike part. How can companies endorse people using stack overflow with the danger of using code they were not supposed to, since very few commercial companies can actually abide to this rule

  15. tweakt says:


    One small request: Please use the extension .7z so it’s clear what the file format is.

  16. nobody_ says:

    Awesome, I can definitely put this to good use. *Almost* as good as an API, so I’ll take what I can get. Will there be a way to automatically retrieve the latest dump when it arrives? Perhaps an RSS feed?


    Oh, and as I’m looking over the data, there appears to be a small bug: everyone is a year older in the dump than on the site.

  17. wishi says:

    Any apps there to search through this? Could be an awesome KB.

  18. Mark says:


    Whats to stop your competitors (i.e. the hyphen one) from uploading this data into there db’s?

  19. Peter Cooper says:

    You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).

    Has this manner been specified anywhere? That is, what is the expected way to attribute the use of this data?

  20. Jake says:

    1.3Mb/s with only 18 seeds… and you wonder why anyone holding IP rights is effed.

  21. Jake says:

    Whoops, make that 1.3MB/s

  22. Patrick Johnmeyer says:

    And yes, it’s still fun to say “data dump”.

    In the spirit of Jeff’s comment…

    Would you consider this an excremental improvement to the system?

  23. Daniel says:

    Thanks for the data dump, this is awesome.

    Ditto with tweakt. I didn’t read the entry close enough and was confused why the file wasn’t exploding until I re-read that it was 7zipped.

  24. Darren Kopp says:

    awesome. i’ll get started on the “which porn star are you?” app that correlates your SO score to the popularity of porn stars.

    rev 1 will even come with a badge! yay internets!

  25. Roel says:

    So I’m probably missing something, but how can you license content under a CC-license without the original authors’ consent? Is there a EULA agreement when users sign up that says they relinquish rights to their UGC, to the extent permitted by local law, to SO (whoever that may legally be)? I read the post in the first link - that post does not address what the anchor text suggests it does, though. Where do users give up their rights on the content they submit? The faq doesn’t seem to address this, nor can I find a ToU for on stackoverflow.com (I would expect a link to it in the footer but I can’t find it anywhere else, either).

  26. Brent Ozar says:

    I just posted a blog entry and tutorial video on how to data mine this with Excel and SQL Server Analysis Services:


  27. Bremen says:

    Mm, 2.2MB/s.

    I’m also a little confused as to how the license pertains to code people get help with on the site — would using code from the site therefore require the same license?

  28. Daniel says:

    Maybe there’s something bad with my download, but posts.xml isn’t well-formed for me. In particular, Python’s xml.parser.Expat is giving me this error:

    xml.parsers.expat.ExpatError: reference to invalid character number: line 1, column 7602956

    Am I the only one having issues with posts.xml, or should I use a parser which is more accepting of bad XML at it? (lxml didn’t work either, actually, but it doesn’t give a column number for failure so its output wasn’t particularly interesting.)

    Also, I’m finding it really annoying that there’s no data definition, leaving us all to figure out the format of the XML files ourselves; I wish I knew what attributes to expect (and what format they are in) ahead of time.

  29. Brent Ozar says:

    In case the SQLServerPedia site is unavailable (as it already seems to be) you can get watch the video tutorial on my site (which hilariously, has a bigger server) -


  30. Derek says:

    > anonymized public data dump

    Why bother with the anonymization? It’s got to be trivial to reverse. If you want to keep voting records and such secret, you should just not include them in the dump. Remember when AOL released “anonymous” data?

  31. Assaf says:

    This is wonderful. I can’t wait for all the web apps that are bound to pop up and analyze this in all sorts of ways. Thanks, SO dudes.

  32. dbr says:

    I’ve posted a slightly-meta’ish-question on Stackoverflow about interesting statistics found in the data..


  33. BCS says:

    What should people like me do who CAN NOT run torrents? I don’t have anywhere I run a computer where I would be allowed to server out stuff as part of a torrent.

  34. Jameel says:

    Awesome stuff guys , thanks

  35. Wade says:

    Here’s an interesting essay by Bruce Schneier on anonymous data:

    Why “Anonymous” Data Sometimes Isn’t

    and a more recent blog post:
    Identifying People using Anonymous Social Networking Data

    Pretty interesting stuff!

  36. John Topley says:

    An XSD would be nice.

  37. mgb says:

    >So I’m probably missing something, but how can you
    >license content under a CC-license without the >original authors’ consent?
    The CC logo and the FAQ clearly explain that content is licensed under CC. You are not relinquishing any rights over your content (IANAL) you still own it and are free to write the same question/answer somewhere else. All you are doing is giving SO the right to use your content under CC.

    Code fragments poste din answers would probably be small enough to be used freely in your own code. Me saying use std::stringstream with .str() to convert a long into a string in an answer doesn’t exactly make your software package a derived work.

  38. SA says:

    So can’t someone take this dump and create a competing site?

  39. mgb says:

    Yes in theory, but they would have to make it more attractive than SO for visitors, which means better Google rank and there wouldn’t be much point unless you also ran more ads - which would turn visitors away.
    Microsoft is doing something like this with their new ‘bing’ search site, they are repackaging wikipedia pages as reference.

  40. Greg Hewgill says:

    > So can’t someone take this dump and create a competing site?

    Absolutely. I remember when Wikipedia was new, clone sites came up all the time. Searching for things in Google was annoying because at the time, many of the knockoff clone sites actually had higher pagerank (that didn’t last long).

    Although they thought these clone sites were competing in some sense, there was simply no comparison. The live interaction and freshness of content is what makes Wikipedia what it is. It’s the same with Stack Overflow. If somebody were to take the Stack Overflow data, repackage it in almost any way at all (that of course serves their own interest in some way, like with more ads), then it will still ultimately be less interesting than the original site itself. And if you could even ask questions on such a clone site, who would answer them?

  41. BobbyShaftoe says:

    Thanks, this is a good contribution to the community.

  42. Markus Jevring says:

    I think this is a very good initiative, but I also believe that this will create a lot of crap sites that basically just use your data. Today, a search on any major search engine for some concept or somesuch will return a link to a wikipedia page. The same search will also return hundreds of results to other ad-ridden sites that contain basically the same content.

    Now, with free content, just like open source in a sense, this isn’t necessarily bad, but it does increase the clutter on the internet, which makes “real” results harder to find. Arguably, if they use the same data as you, the results they deliver are “real” in the same sense as yours. It’s like with refactoring code in a sense. Try to avoid duplicate code that does the same.

    Anyway, nice initiative. =)

  43. Phil says:


    Like Daniel (#comment-24179) I am having some trouble parsing this posts.xml.

    I have tried the .Net XmlReader and XmlTextReader.

    I am a noob when it comes to reading XML but have also tried the former with XmlReaderSettings.CheckCharacters turned off. It gets further but still fails.

    The 2 errors I get with XmlReader are:

    ‘ ‘, hexadecimal value 0×1F, is an invalid character. Line 1, position 7602959.

    ‘<’, hexadecimal value 0×3C, is an invalid attribute character. Line 1, position 44308159.

  44. .jpg says:

    really wish you hadnt used torrent - its blocked on our corporate network.

    And i might be a stick in the mud but dont allow torrent software on my home pcs either.

    Any chance of simple FTP?

  45. thijs says:

    It would be nice if you could publish an XSD for this as well, that would make life much easier.

  46. Christopher Galpin says:

    @BCS @.jpg

    This torrent appears to be small enough for Torrent Relay: http://lifehacker.com/395857/torrentrelay-downloads-any-torrent-through-your-browser

    From my understanding free users can retrieve <= 800mB files at 200 KBps maximum download but it won’t seed past completion.

  47. Abdu says:

    What would be a good use for the data?
    If I want to search it, I would use SO.

    The only benefit I can think of is to create a better search functionality than SO’s. Like with sort options and better filters.

  48. sth says:

    For those that really can’t use bittorrent:

  49. Simucal says:

    @jpg, why wouldn’t you “allow” bittorrent software on your home computer? Simple FTP would require potentially costly amounts of bandwidth on the part of Stackoverflow. Using torrents they can utilize the bandwidth of all others who choose to seed and download the data and greatly reduce the strain on their servers.

    Even WoW patches are distributed via torrent.

    @Abdu, One major possibility I see is the ability to have an “offline” version of Stackoverflow’s content. There are certain instances and places where internet isn’t possible and with this data dump they have access to a very nice programming resource without the need for a connection.

  50. Jay Stevens says:

    I’m also having problems with invalid characters in posts.xml (using a couple different parsers — including Excel and a Xerces-based interface). Any hints on this?

    Also, if you happen have some data analysis insights but don’t happen to have 750 rep yet, you can’t share those insights on SO. :(

  51. Jon says:

    I have put a view of the file schema on my blog.

    StackOverflow Download Data Schema

    Hope it helps!


  52. nobody_ says:

    If anyone is having trouble dealing with the XML files, I’ve copied the data to a sqlite3 database, which can be found here:


    The only changes are in the format of times - the date/time literal has been replaced with the equivalent unix timestamp in most places; in the votes table, only the data was provided, so 00:00 was used.

    This is an extremely large database (1.0GB uncompressed), and without indexing queries can take minutes to complete. the file index.sql contains sqlite expressions that will create indexes on each integer field in the database, as well as the badges table name field, as it might be helpful to select or group by the name of the badge.

    As indexing noticeably adds to the size of the database, to save bandwith no indexing has been done beyond what sqlite3 automatically does for integer primary keys. however, you can add indexing by following the directions in the README file.

  53. Luke Venediger says:

    Thanks @nobody for the prepped sqlite db! How did you get around the xml error in posts.xml? I got this when parsing with cElementTree:

    SyntaxError: reference to invalid character number: line 1, column 7602956


  54. Alex Martelli says:

    I just posted at http://stackoverflow.com/questions/960020/how-can-i-know-the-average-reputation-of-the-users-in-so/ a Python 2.5 script that (on my Mac) parses the .xml files (with cElementTree) without problems — they’re Unicode with a byte-order mark at the start, not sure what underlying parser you have that can’t deal with that. Maybe you could try downloading and installing lxml…?

  55. nobody_ says:

    @Luke Venediger

    I wasn’t aware of cElementTree when I wrote my import script, which is more or less regular expressions and generating the appropriate “insert into” statements. It basically did a low-level match for column=”value” on a row by row basis, so it’s very possible for corrupted or invalid data to creep in. In any case, I’ll release it if there’s demand, but it looks like I might be able to get cleaner code and better performance if I go with a library that’s meant for parsing XML, provided I can work around the error that you stumbled across.

  56. Jon Skeet says:

    @Alex: My guess is that the parser your Python script is using just isn’t as string as the .NET one. If there really *is* a character U+001F in the data, it’s undeniably invalid XML.

    I’m going to have a look later today, hopefully (and convert the files to Protocol Buffers if possible, partly to see the difference in size).

  57. Brent Ozar says:

    Yeah, it’s got invalid XML in it. I emailed w/Geoff yesterday. There are comments and posts that use strings that, when output to XML, are hosing things up. I don’t know enough (anything) about SQL-to-XML conversions to help on that one.

  58. Jason Alexander says:

    Dang, yeah, I ran into the same problem while generating XSD’s for these. Looking forward to getting the corrected data.

  59. nobody_ says:

    Hmm.. I just went into posts.xml with a hex editor, and it looks like the character at position 7,602,956 is an “s” (0×73). It’s in the body of post id 139921 for future reference. Based on Jon Skeet’s post, I also did a search for 0×1f in the file, but I could not find any occurrences. In short, I can’t find anything wrong with the file, but I am concerned about the integrity of the SQLite DB that I’m distributing, so further help in nailing down the source of the invalid XML would be appreciated.

  60. Jon Skeet says:

    @nobody_: The “position” is likely to be a position in terms of XML characters, not raw bytes. It’s also possible that there’s an entity reference of 0×1F rather than the character appearing directly. It’s unfortunately quite tricky to analyze the problem when there’s so much data, and when it’s all on one line :(

    If I can work out exactly where the file has gone wrong and how to fix it, I’ll put up a small patch program. In the meantime, I’m just indexing your SQL database :)

    It may well be that *using* the database, it’s a lot easier to find the problems in the XML files…

  61. Geoff Dalgas says:

    We are getting close to a new export file for the month of June which will resolve the data issues. Look for an update on this blog where we will post a new torrent for all to download.

  62. Jon Skeet says:

    Before the new data dump arrives, however, I believe I’ve found all the problems. (Not sure about posts.xml yet, as I haven’t converted that to protobuf format, but the rest work.)

    For both comments.xml and posts.xml, open up the files in your favourite “large file hex editor” (I used “HHD Hex Editor Nero”) and use a regular expression (treating the binary data as ASCII) to replace “&#x0[12345678BCEF];” with “&#x3F”; ditto for “&#x1.;”. This will get rid of all the entity references which lead to invalid XML characters.

    Then find “” (that’s the “pre” tag in case it’s stripped here) in posts.xml at offset 0×2a45112. Do what you like with this - I changed it to “[pre]“. That should be all it takes.


  63. Greg Hewgill says:

    There is also some left angle brackets (U+3008) in http://stackoverflow.com/questions/151744/are-you-using-ascii-art-to-decorate-your-code/151757#151757 that got converted from their Unicode representations to an ASCII left angle bracket instead of their UTF-8 representation. Of course the literal left angle bracket is invalid inside an XML attribute.

    I’ve got a very hacky Python script that attempts to sanitise the XML input, it at least makes it possible to parse with the standard Python SAX parser. If there is interest I’ll post the script.

  64. Jason Alexander says:

    @greg - I’d love to grab that py script from you, if at all possible. I’m getting impatient waiting, and was about to go your route. :)

  65. Phil H says:

    I managed to clean up the posts.xml file so the .Net XmlReader can parse it. Wrote a little program to add a newline before each row element (its all 1 line otherwise), and then used a text editor to check out the problems.

    I then hacked a little util together to produce each “row” as a separate Microsoft Word document. (I am doing this to produce a large set of documents for testing the peformance of another app.) Many hours later I aborted it after 340,000 rows and 3.2 GB of Word documents produced. (I say this cause if anyone else is interested in this sort of usage, let me know and I can look at sharing stuff)

    Anyway - Posts.xml has a PostTypeId attribute which seems to contain “1″ or “2″ which I assume represent “question” and “answer” respectively. What isn’t immediately obvious to me is how the answers are mapped to their question. What am I missing?

  66. Jeff Atwood says:

    OK, the new June dump is seeded as a torrent now:

    - Slightly more data (May dump was taken at the end of May)
    - ParentID is present for Answers (PostTypeId = 2)
    - AcceptedAnswerID is present for Questions (PostTypeId = 1)
    - Fixed any invalid XML data in all files
    - Named the file .7z so people better understand what compression to use

  67. Jon Skeet says:

    I haven’t checked the new dump yet, but would it be possible to make future ones create one line per row, instead of the whole XML document being on a single line? On text editors which can cope with large files, that would make it a lot saner to deal with.

    I’d like to suggest we create a wiki somewhere with the most interesting queries. (I don’t think an SO question would really be appropriate…)

    Oh, and I didn’t *spot* anything to indicate whether a post was community wiki or not. I don’t have the database in front of me at the minute, so I can’t check… but if it’s not there now, could you include it fairly easily?

  68. Brent Ozar says:

    I just wrote a blog entry explaining how to import the data into SQL Server and what the different fields mean:


    Jon - if a Post has OwnerUserId = 1, that’s the Community user account, so it’s a wiki.

  69. Jon Skeet says:

    @Brent: Ah, cool. So by counting OwnerUserId=22656 I really was only counting my non-CW posts. Fun.

    It’s not often I wish I had better SQL skills, but this kind of data does it. I might generate a Protocol Buffer version without any text in, which should be easily loadable into memory… then I could use LINQ to Objects, which I’m much more comfortable with :)

    I wonder if LINQPad has some easy way of loading in data that I could use… otherwise I could just munge Snippy a bit. So sad that I have about 101 other things I really should be doing…

  70. Brent Ozar says:

    Jon - great idea about the wiki. I slapped together a section over at SQLServerPedia:


    I’ve got my XML import queries, schema notes and a couple of queries there, and I’ll add more after I get done with my next pet project.

  71. nobody_ says:

    @Brent Ozar:

    I don’t have the latest dump, so things might have changed, but all my posts have a OwnerUserID of 658, even the ones that are CW. Also, the user id of Community is -1 (Jeff has id 1) and does not own any posts.

    @Jon Skeet:

    Regarding the Wiki idea, I suggest you contribute to this question dbr set up:


    I know you said you thought it wasn’t appropriate for a SO question, but I disagree, and I think this is what CW is made for. Besides, it gets the most coverage and visibility if it stays on SO.

  72. Jon Skeet says:

    Fair enough, as there’s already a question there I’ll do what I can with that (when I have time to do anything useful, admittedly - probably not tonight). Unfortunately I’m more likely to be a consumer than a producer on this one.


  73. Greg Hewgill says:

    It looks like the June dump cleaned up all the remaining XML formatting problems, so that script I mentioned yesterday is no longer needed.

    Looking at the latest dump, there appear to be no posts where OwnerUserId=”-1″, so I’m still not certain how to identify community wiki posts.

    On another note, the number of questions on a day last week reached 1307, which for the first time exceeds the number of questions per day on launch week (which was 1301). Here’s some simple graphs: http://hewgill.com/~greg/stackoverflow/stats.html

  74. Brent Ozar says:

    Greg - Sorry, it’s not negative one - that was supposed to be just one. The community ID is 1. If you query the users table it’ll make sense. Nice graphs! I’ve put together a little site showing some of the metrics at:


  75. nobody_ says:

    @Brent Ozar:

    Jeff’s ID is 1, Community is -1:

    sqlite> select id, displayname from users where id = 1 OR id = -1;
    1|Jeff Atwood

    capcha: stack 240

  76. BobbyShaftoe says:

    This is great that you all did this. It’s good from a transparency and following the CC license sort of standpoint. My question is for users. Judging by the comments, this is a very sought after feature.

    What are you going to do with it? Unless you are trying to create a doomed-to-fail ripoff of SO or something like bigresource.com (this pops up in my search results way too much), I don’t really see the use. You can already get the data on SO. I think a full SO API would be more interesting. But that’s just me. I am interested to know what the uses will be though.

  77. Phil says:

    Thanks for the update Jeff!

  78. nobody_ says:

    I just created a new sqlite3 file with the June 2009 data:


    This time I decided to include indexing directly, so the file’s a bit larger (~500MB gzipped, 1.6GB uncompressed).

    Also, please disregard README~, Emacs can be a bit overzealous with its autosaving. Thanks.

  79. Glitz says:

    Is it possible to upload the tag db table(s) as well next time? I understand if that’s too much though. Great stuff regardless!

  80. nobody_ says:


    All the tag data is there, you can find it in the tags column in the posts table, so you can normalize it into a separate table (or two) if you want. However, I do with that the tags were separated by maybe a space instead of >< - it would definitely make it easier to read at a glance.

  81. nobody_ says:

    with -> wish

    and the >< is supposed to be & gt;& lt;

  82. Stu Thompson says:

    Is there somewhere else where folks are discussing this? E.g.: I see that the community user (-a) has ~11k of down votes. Why?

    (This is just so fracking fun!)

  83. mmyers says:

    @Stu Thompson:

    The community user “Own[s] downvotes on spam/evil posts that get permanently deleted.”

    From http://stackoverflow.com/users/-1/community .

  84. Stu Thompson says:

    Ah, cool. Thank you mmyers.

    I’ve my first take on the stats, which looks at up vs. down votes over time at http://lanai.dietpizza.ch/geekomatic/2009/06/09/1244565360000.html

    Hours of entertainment! I’m so glad this data is available now. :)

Leave a Reply