The Internet Archive discovers and captures web pages through many different web crawls.
At any given time several distinct crawls are running, some for months, and some every day or longer.
View the web archive through the Wayback Machine.
I always ask there if I can't find what I'm looking for.
Here are more and more data sets. These are general data sets. Email me if you have a specific data set in mind (e.g. web-as-corpus, spam, images, social, reviews, etc.). I have a big file of information.
I looked at the site, and I see some data but I didn't find what I would have hoped for. I couldn't find yield curves, and historical exchange rates <i> up to <i/> today (available on the ecb site in xml format). Certainly I would have thought yield curves were a front page item.
Things that would be very cool would be 1. financial statements in a database format. I know you can scrape this but I don't know if they are available legitimately?
2. Historial Implied volatilities and historical observed volatilities.
Thank you all for posting links and links to links to datasets, I have an unrelenting interest in data aggregation and machine learning, and didn't even know where to start. So helpful, and I am no longer stuck. :)