Hacker Newsnew | comments | ask | jobs | submitlogin
Free, Public Data Sets (jacquesmattheij.com)
262 points by iisbum 109 days ago | comments


bravura 109 days ago | link

get.theinfo is the best way to find data sets. They are a bunch of data hoarders who can help you: http://groups.google.com/group/get-theinfo/?pli=1

I always ask there if I can't find what I'm looking for.

Here are more and more data sets. These are general data sets. Email me if you have a specific data set in mind (e.g. web-as-corpus, spam, images, social, reviews, etc.). I have a big file of information.

    http://theinfo.org/
    http://infochimps.org/datasets
    http://ckan.org [Comprehensive Knowledge Archive Network]
    http://www.datawrangling.com/some-datasets-available-on-the-web.html
    http://del.icio.us/pskomoroch/dataset
    http://www.reddit.com/r/datasets/
    http://news.ycombinator.com/item?id=1242029
    http://www.reddit.com/r/opendata
    http://www.trustlet.org/wiki/Repositories_of_datasets
    http://www.daniel-lemire.com/blog/data-for-data-mining/
    http://www.quantlet.org/mdbase/
    http://datamob.org/
    http://freebase.com/
    http://infochimp.info/ics/data/ripd/www-personal.umich.edu/~mejn/netdata/
    http://www.archive-it.org/public/all_collections

    Large:
        http://www.ckan.net/tag/read/size-large
        http://www.diggingintodata.org/Repositories/tabid/167/Default.aspx
Web as corpus:

    Good instructions:
        http://corpus.leeds.ac.uk/internet.html#description
    http://sslmit.unibo.it/~baroni/bootcat.html

    http://www.drni.de/wac-tk/index.php/Documentation
etc. Email me if you need more http://cleaneval.sigwac.org.uk/ http://liste.sslmit.unibo.it/pipermail/sigwac/2007-November/... http://wacky.sslmit.unibo.it/doku.php?id= http://clic.cimec.unitn.it/marco/research.html

-----

seancron 109 days ago | link

Here's some more links to data sets:

http://radar.oreilly.com/2010/03/open-data-pointers.html

http://www.datawrangling.com/some-datasets-available-on-the-...

http://del.icio.us/pskomoroch/dataset

http://infochimps.com/collections/datamob (and the other collections on the site)

http://www.data.gov/

-----

mindcrime 109 days ago | link

If anyone is looking for more datasets, see:

http://datasets.reddit.com

http://opendata.reddit.com

and

http://www.quora.com/Where-can-I-get-large-datasets-open-to-...

for some good lists of available stuff.

-----

zipdog 109 days ago | link

The wikipedia dump is great, but I've started using http://wiki.dbpedia.org/ which has an API to query the dumps.

Thanks for these, iisbum. I wish more public data was available in db, xml or similar structures - too often I find myself scraping government sites or pdfs to get the tables I need

-----

sosuke 109 days ago | link

Heh, a day after he leaves HN he makes the first page. He will still be here whether he visits the site or not.

-----

cstuder 109 days ago | link

And I recently discovered Google Refine, for cleaning up messy datasets.

http://code.google.com/p/google-refine/

-----

LiveTheDream 109 days ago | link

née Freebase GridWorks http://blog.freebase.com/2010/11/10/google-refine-previously...

-----

adw 109 days ago | link

We've got quite a lot of public economic data: http://timetric.com/.

If you're up to something in the economic data space we'd love to talk. Happy to take this to email (andrew@timetric.com) if anyone's interested.

-----

hessenwolf 108 days ago | link

I looked at the site, and I see some data but I didn't find what I would have hoped for. I couldn't find yield curves, and historical exchange rates <i> up to <i/> today (available on the ecb site in xml format). Certainly I would have thought yield curves were a front page item.

Things that would be very cool would be 1. financial statements in a database format. I know you can scrape this but I don't know if they are available legitimately? 2. Historial Implied volatilities and historical observed volatilities.

-----

adw 102 days ago | link

http://timetric.com/dataset/exchange_rates_forex_europe/ for the exchange-rate data, at least.

-----

hessenwolf 102 days ago | link

Okay - it's there...

Is it your site? Are you going to add yield curves?

-----

agentultra 109 days ago | link

What about http://ckan.org/ ?

The Comprehensive Knowledge Archive Network! Pretty sweet resource really.

-----

djsun 106 days ago | link

The CKAN software is a platform for hosting data and metadata, but as far as I see, http://ckan.org does not actually list data sets.

-----

pudo 104 days ago | link

try http://ckan.net for the data, http://ckan.org is for the software behind it :)

-----

hvs 109 days ago | link

Don't forget the Lahman Baseball Database with information from 1871-2010

http://baseball1.com/statistics/

-----

jamwt 109 days ago | link

And, for very detailed play-by-play data for decades of games, check out retrosheet: http://www.retrosheet.org/game.htm

-----

gtani 109 days ago | link

http://www.kdnuggets.com/datasets/index.html

http://lib.stat.cmu.edu/datasets/

http://datamob.org/

-----

steveklabnik 109 days ago | link

Don't forget Stack Overflow! http://data.stackexchange.com/

-----

dmpayton 109 days ago | link

Kinda surprised no one has mentioned Factual. I'm using some of their diabetes data for my side-startup.

http://www.factual.com/

-----

casperc 109 days ago | link

Their write that most the data is available for download. I can't find it anywhere though, only the various APIs. Have they remove the possibility of downloading the data?

-----

damoncali 109 days ago | link

http://infochimps.com also has a bunch.

-----

balakk 109 days ago | link

https://datamarket.azure.com/

Some free, some paid.

-----

svag 109 days ago | link

There is also the IMDB database in various format provided by IMDB itself here: http://www.imdb.com/interfaces

Edit: Although the use of this database is not free, I believe for personal use is just fine to download and experiment...

-----

llimllib 109 days ago | link

http://www.gapminder.org/data/

-----

Perceval 109 days ago | link

For international relations data, Correlates of War hosts a number of data sets: http://www.correlatesofwar.org/Datasets.htm

-----

toisanji 109 days ago | link

anyone know of a dataset that has dates for when companies when companies registered or announced in the news? For example I would like to see the data hackernews was launched.

-----

jmtame 109 days ago | link

i've had trouble finding geographical boundaries on neighborhoods in U.S. cities (e.g. downtown areas and residential neighborhoods). anyone know where i can find this?

-----

timr 108 days ago | link

http://www.zillow.com/howto/api/neighborhood-boundaries.htm

-----

kfranken 108 days ago | link

It's not exactly neighborhoods, but the US Census TIGER database has block and blockgroup boundaries with associated demographic data. You could probably synthesize that into "neighborhood" definitions. http://www.census.gov/geo/www/tiger/tgrshp2010/tgrshp2010.ht...

-----

ericwaller 109 days ago | link

It's not in dump format, but you should take a look at simplegeo's (free) api: http://simplegeo.com/

-----

tszming 109 days ago | link

Open Directory RDF Dump: http://rdf.dmoz.org/

-----

agbell 109 days ago | link

Non-Free Google data:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=...

This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts.

-----

LiveTheDream 109 days ago | link

I track datasets that I come across at http://www.delicious.com/tobym/dataset

-----

jcr 108 days ago | link

United Nations stats (lots of goodies)

http://unstats.un.org

some free, some paid

http://infochimps.com/

AIS Data (Marine Traffic)

http://www.aishub.net/

http://www.marinetraffic.com/ais/

And there's a great list of sources on Quora

http://www.quora.com/Where-can-I-get-large-datasets-open-to-...

-----

lkozma 109 days ago | link

http://news.ycombinator.com/item?id=1493768

-----

WildUtah 109 days ago | link

Does anybody have precinct-level election results for the USA? A set for recent elections would be great for public access redistricting apps that will become relevant this year.

-----

joubert 109 days ago | link

I have links to a few govt.-provided data sets at http://elev.at

-----

l0nwlf 108 days ago | link

OpenStreetMap data : http://wiki.openstreetmap.org/wiki/Planet.osm

Geonames : http://download.geonames.org/export/dump/

OS Open Data (UK Specific) : http://www.ordnancesurvey.co.uk/oswebsite/opendata/

-----

random42 107 days ago | link

I'd prepared (based on other datasets) a smallish movie tweet dataset. You may find it useful, if working with tweets and/or reviews.

https://github.com/mohitranka/TwitterSentimentCorpora

-----

eli 108 days ago | link

Some US Gov't data sites no one else mentioned:

http://data.govloop.com/ has data and lots of pointers to local government data.

Also I'm surprised no one mentioned Carl Malamud's site: http://public.resource.org/ - Lots of US gov't and legal data in friendly formats.

-----

fedd 109 days ago | link

do all of them have some uniformed api? that would be great, ideally. query and cache all of them on demand from your own app without additional programming.

bookmarked and shared this thread.

-----

random42 107 days ago | link

Anyplace I can find _small_ free web spam dataset? ( for commercial use, sorry :( )

All the datasets I found on www, are Huge (in double digit GBs..).

-----

pwenzel 109 days ago | link

For those interested in transit data, check out the GTFS Data Exchange, a directory of many agencies' scheduling and map data, following the Google Transit Feed Specification.

http://www.gtfs-data-exchange.com/

-----

nico_h 108 days ago | link

http://www.naturalearthdata.com/ From the website : Natural Earth is a public domain map dataset available at 1:10m, 1:50m, and 1:110 million scales as tightly integrated vector and raster data ...

-----

kaffeinecoma 109 days ago | link

This is a real treasure to come across. I hope we'll keep seeing jacquesm's blog postings here.

Anyone know of any publicly available song lyric databases?

-----

nivertech 108 days ago | link

I looking for free public domain large high-resolution imaging datasets.

Something like satellite imagery, medical imaging, semiconductor masks and wafers photos or CAD files, etc.

Any pointers?

-----

brainid 108 days ago | link

Here are medical imaging datasets I am aware of: Neuroimaging (see http://www.nitrc.org for others) OASIS http://www.oasis-brains.org/ ADNI http://adni.loni.ucla.edu/ (huge dataset, requires application) OpenfMRI http://openfmri.org/ EEG http://eeg.pl/epi

Some other applications, example CT Colonography http://www.acrin.org/

-----

mcauser 108 days ago | link

Heaps of useful info: http://www.nationmaster.com

-----

wladimir 109 days ago | link

Wow, useful stuff. This thread goes into my bookmarks.

-----

youknow 108 days ago | link

CIA World Factbook (demographics, geography, communications, government, economy, military stats of countries):

https://www.cia.gov/library/publications/download/

-----

_topher 109 days ago | link

Thank you all for posting links and links to links to datasets, I have an unrelenting interest in data aggregation and machine learning, and didn't even know where to start. So helpful, and I am no longer stuck. :)

-----




Lists | RSS | Search | Bookmarklet | Guidelines | FAQ | News News | Feature Requests | Y Combinator | Apply | Library

Analytics by Mixpanel