The latest version of the Stack Overflow Creative Commons Data Dump is now available. This reflects all public data in Stack Overflow up to October 2009.

Download the Stack Overflow Creative Commons Data Dump via BitTorrent

Please note that the Stack Overflow data dumps are now hosted at LegalTorrents! You can subscribe via RSS and be notified every time a new dump is available.

Have fun remixing and reusing; all we ask is for proper attribution.

7 Responses

  1. Jeff Atwood says:

    oh, and if there are any remaining issues with the data dump PLEASE let us know now. We’re planning to dump Super User and Server Fault next month as well, and I want this to be the final data format..

  2. Joel Coehoorn says:

    Ahh! It’s finally here, and of course it arrives just after I reinstall my operating system. For anyone who cares, I have a significant update planned for StackQL, but it might be Tuesday before it’s ready now.

  3. Alexey says:

    How large of a dump do we have here?

  4. Sung Meister says:

    Bah, after considering the possibility of what can be done with this data, I ran into a problem of *how* to ETL this to my database ;)

    Thank you for the data dump, as always, Jeff

  5. Kyle Cronin says:

    Nice, it’s good to know there’s a roadmap for dumps of the other sites. I just have a few questions:

    1. How will the dumps of the different sites be handled? Separate torrents? Separate files in the same torrent?

    2. How will you address data that spans sites, such as the question migration information and account association? (from )

    3. You mention ServerFault and SuperUser – will a dump also be available for Meta?

  6. Joel Coehoorn says:

    One question on the data format that might be worth addressing before setting the format in stone:

    On the posts table there is both a LastEditUserId and a LastEditDisplayName column. Do we really want both? Especially as there is now only the OwnerUserID column, and the Owner is probably more important most of the time. Also, wasn’t OwnerDisplayName part of the data at one time?

    Nothing too serious, but worth a mention at least.

  7. Jesse Hartwick says:

    I’d expect that the best way to handle migrations is to only include the data in the dump of the site to which the data was migrated.

    The dumps only contain the most recent edit of a question/answer/etc., and the logical extension of that is for the dump to have the most recent edit across the entire tetralogy. The ‘new edit’ on the site to which the data was migrated would supersede the ‘previous edit’ on the site from which the data was migrated. As no previous edits are included in the dump, so too should no migrated data be included in the dump for the site from which the data was migrated.

