The Internet Archive discovers and captures web pages through many different web crawls.
At any given time several distinct crawls are running, some for months, and some every day or longer.
View the web archive through the Wayback Machine.
Web wide crawl with initial seedlist and crawler configuration from March 2011. This uses the new HQ software for distributed crawling by Kenji Nagahashi.
What’s in the data set:
Crawl start date: 09 March, 2011
Crawl end date: 23 December, 2011
Number of captures: 2,713,676,341
Number of unique URLs: 2,273,840,159
Number of hosts: 29,032,069
The seed list for this crawl was a list of Alexa’s top 1 million web sites, retrieved close to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT) crawler software and respected robots.txt directives. The scope of the crawl was not limited except for a few manually excluded sites.
However this was a somewhat experimental crawl for us, as we were using newly minted software to feed URLs to the crawlers, and we know there were some operational issues with it. For example, in many cases we may not have crawled all of the embedded and linked objects in a page since the URLs for these resources were added into queues that quickly grew bigger than the intended size of the crawl (and therefore we never got to them). We also included repeated crawls of some Argentinian government sites, so looking at results by country will be somewhat skewed.
We have made many changes to how we do these wide crawls since this particular example, but we wanted to make the data available “warts and all” for people to experiment with. We have also done some further analysis of the content.
If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you’re hoping to do with it. We may not be able to say “yes” to all requests, since we’re just figuring out whether this is a good idea, but everyone will be considered.
Climate e-mail hackers ‘aimed to maximise harm to Copenhagen summit’
Ben Webster, Environment Editor
E-mails alleged to undermine climate change science were held back for weeks after being stolen so that their release would cause maximum damage to the Copenhagen climate conference, according to a source close to the investigation of the theft.
Climate change sceptics obtained the e-mails by hacking into a computer at the University of East Anglia. Professor Phil Jones, director of the university’s Climatic Research Unit (CRU), has agreed to stand down during an independent review of the affair.
The first hack was in October or earlier, the source said. The e-mails were not leaked until mid-November. Sceptics allege that Professor Jones’s e-mails show that climate change data was manipulated and that scientists discussed how to suppress alternative views. The leader, terms of reference and timing of the review are expected to be announced today or tomorrow. The university has received thousands of international media calls and is concerned that the row is distracting attention from the key issues due to be discussed at Copenhagen.
Sceptics, including Lord Lawson of Blaby, the former Conservative Chancellor, have seized on the e-mails as evidence that man-made global warming has been exaggerated.
The e-mails, which were sent over a 15-year period ending on November 12, first appeared on websites run by sceptics on November 20. They were posted with a message, apparently from the hacker, which said: “We feel that climate science is, in the current situation, too important to be kept under wraps. We hereby release a random selection of correspondence, code, and documents. Hopefully it will give some insight into the science and the people behind it.”
The computer was hacked repeatedly, the source close to the investigation said: “It was hacked into in October and possibly earlier. Then they gained access again in midNovember.” By not releasing the e-mails until two weeks before Copenhagen, the hacker ensured that the debate about them would rage during the summit. Very few of the e-mails are recent. One, in which Professor Jones mentions a “trick” which could “hide the decline” in temperatures, was sent in 1999.
Bob Ward, director of policy at the Grantham Research Institute on Climate Change, based at the London School of Economics, said: “From the timing of the release of the e-mails, it seems that the intention was not just to inform the public but to undermine mainstream climate researchers and influence the process in Copenhagen.”
The Met Office Hadley Centre, which uses CRU data, said the same warming trend had been detected by two other completely independent sets of data held in the US, at the Goddard Institute for Space Studies, which is part of Nasa, and the National Oceanic and Atmospheric Administration.
Dr Peter Stott, of the Met Office, wrote in a briefing on its website that the three data sets agree that “global-average temperature has increased over the past century and this warming has been particularly rapid since the 1970s”.