A global network of experts archiving the Web for future generations.
The Spanish Web Archive is maintained by the National Library of Spain with the collaboration of regional libraries. Its purpose is to become the most complete Spanish Web repository. It contains all web sites hosted on the Spanish domain .es and also Spanish web sites hosted elsewhere (.com; .org; .edu, etc.). In order to collect the Spanish web as complete as possible, three different approaches have been followed:
1) Domain .es crawls, harvested annually from 2009 to 2013 in collaboration with Internet Archive, using Heritrix and Wayback Machine. Since 2016 one domain crawl per year will be run, using NetarchiveSuite.
2) Selective crawls since 2012. Since 2015 about 30 news media sites are being crawled every day.
3) Event crawls since 2011 on events of general interest at national level, such as general and local elections, royal transition, etc. Harvesting since 2009 under the general Legal Deposit Law. Since 26th October 2015 the Royal Decree regulating the legal deposit of online publications allows the National Library of Spain and the regional libraries to collect Spanish websites as part of the legal deposit and make them available to the public observing the terms of the copyright law. Our current collection amounts to >117 TB (December 2015). Not launched publicly yet. Access on-site is planned in the short-medium term.
In 2012, BAnQ, undertook the selective harvesting of Web sites from Québec. BAnQ targets sites that are representative of Québec-centric activity on the Web. Sites collected can be accessed online or in BAnQ facilities, depending on the authorizations granted by publishers.
The professionals of the Grande Bibliothèque Branch, the National Library Branch and the National Archives branch are responsible for selection, under the coordination of the Legal Deposit and Heritage Collections Preservation Branch.
In general, the sites selected will be in French or English or in a multilingual version if a French version exists.
All types of content are subject to harvesting (video, audio, image, text, etc.), unless the web crawler is unable to collect it.
Since 2006, the BnF shares with INA responsibility for the legal deposit of the French online publications and web material. The BnF web archiving program started in 2002 with the first snapshots of election websites, then continued from 2004 with a 5-year partnership with the Internet Archive, which included performing annual broadcrawls of the French domain and the acquisition of historical collections. Today, the BnF performs both domain and selective crawls internally.
In 2015, the BnF archive consists of ca. 668 TB of data (26 billion files) from 1996 until the present day. The scope of this collection is the French web (the .fr domain and all material produced in France or by a publisher based in France) and combines domain, thematic and event harvests. Special collections include a range of national, local and European election harvests, along with thematic collections such as online diaries, blogs and literary websites, and activist websites documenting the social history of the Web. 85 curators contribute to the selection of seeds, forming collections in most areas of knowledge, in line with the BnF's encyclopedic heritage. In addition, the BnF works with partner institutions who select seeds for certain thematic collections, including more than twenty regional libraries in France, as well as research laboratories, associations and professional organisations.
Due to legal restrictions, the BnF web archives can only be searched and browsed by researchers within the library premises in Paris.
The Columbia University Libraries web resources collection program archives selected websites in thematic areas corresponding to the Libraries' existing collection strengths, websites produced by affiliates of Columbia University, and websites from organizations or individuals whose papers or records are held in the Libraries' physical archives.
Since 2006 the Legal Deposit Law allows The National Library of Estonia to collect Estonian websites as legal deposit copies and make these also available to the public. The owner of the site has the right to restrict access to his/her website on the public archive but the site remains accessible for the researchers in-house.
In November 2013 there was over 1000 records of websites in the subject catalog and two special collections (Estonians Outside Estonia and Elections 2013). The archive is accessible through Wayback and it consists 1,6 TB of data (uncompressed).
The archive is open to the public since November 2013.
The Library and Archives of Canada Act received Royal Assent on April 22, 2004. For the purposes of preservation it allows Library and Archives Canada (LAC) to collect a representative sample of Canadian websites. To meet its new mandate, LAC began to harvest the web domain of the Federal Government of Canada starting in December 2005. As resources permit, this harvesting activity will be undertaken on a semi-annual basis. The website data which is harvested is stored in the Government of Canada Web Archive (GC WA). Client access to the content of the GC WA is provided through searching by keyword, by department name, and by URL. It is also possible to search by specific format type, e.g. .pdf. At the time of its launch in Fall 2007, approximately 100 million digital objects (over 4 terabytes) of archived Federal Government website data was made accessible via the LAC website. The GC WA currently contains over 170 million digital objects and more than 7 terabytes of data.
WAX is part of the Harvard University Library's central infrastructure for the capture, management, storage, preservation and display of web sites for long-term archiving.
The Croatian Web Archive (HAW) is a collection of resources gathered as the result of Web harvesting. The Archive's mission is to collect, preserve, and make permanently accessible the Croatian web resources as part of Croatia’s national heritage.
Since its beginning up until 2011, the Croatian Web Archive was used for the purposes of the selective harvesting of the Croatian web resources. That year, in order to complete and improve the national collection of archived web resources, the National and University Library in Zagreb decided to annually harvest the Croatian national domain (.hr) as well as to regularly carry out thematic harvesting projects.
The Archive may be publicly accessed on the Internet. Web resources with publisher restrictions may be accessed on-site only by one user at a time.
Since February 2009, Ina has started the focused and selective archiving of audiovisual media related web sites. A core list of about 5000 web sites is regularly updated and enriched. They are being crawled on a daily basis. Access will shortly be available on site at the Ina consultation centre which is hosted within the research library of the François-Mitterrand site of the BnF.
The Internet Archive is a non-profit organization that is compiling a historic database of Web sites and other digital content. IA's web archives now exceed 2PBs of data (compressed) and encompass over 150billion captures collected from 1996 to the present culled from every domain, over 200 million web sites and 40+ languages. This archival database expands with the Internet, and so grows by nearly 100TBs (compressed) every month. Usage of IA's web collections via the Wayback machine average 400-500 requests per second.
Recently, the Internet Archive invested in Sun's open storage server and software technologies, specifically a Sun Modular Datacenter (Sun MD), installed at Sun's Santa Clara campus, supported by the Sun MD remote monitoring service.
The new Sun MD was installed in March 2009. It is equipped with 60 Sun Fire X4500 (Thumper) Open Storage Systems that run the Solaris 10 OS, including the Solaris ZFS file system. Sun's servers with Solaris ZFS storage pools enabled the Internet Archive to double the storage capacity of its old system while using up to 50 percent less power than other servers would use.
Sun engineers monitor power, heating and cooling, fire, smoke, and water detection, and physical access points, and dispatch repair technicians, if necessary. IA Engineers manage the repository software, archival data, and access services provided to researchers and the general public.
Internet Memory Foundation, a non-profit institution, was established in 2004 in Amsterdam under the name European Archive Foundation, to support and develop digital archives in open access. In 2010, it changed its name to Internet Memory Foundation to express its interest in preserving web content as a new media for current and future generations. Currently Internet Memory Foundation hosts hundreds of terabytes of archived websites on open access including its own collection and collections from partner institutions. This includes organisations such as The UK National Archives, the National Library of Ireland and CERN.
The Icelandic Web Archive contains all web sites hosted on the Icelandic domain .is and many web sites hosted elsewhere that are in Icelandic or refer directly to matters of interest to Iceland.
Access to the complete Web Archive is open to the world except for web sites where the user must pay for access and web sites that for some reason are closed by the owners request.
The .is domain has been harvested by the National Library since October 2004 and the policy is to harvest the complete .is domain three times a year. In addition selected web sites are harvested at least weekly and for national events like elections relevant web sites are harvested. Additionally, material from the Internet Archive, covering .is from 1996-2004 is available in the archive.
Finnish Web Archive was launched in 2006. By 2015, the size of the web archive was over 80 TB (compressed).
Annually The National Library of Finland collects representative sample of webpages from webservers 1) either having fi- or ax-domain names, 2) residing physically within Finland, or 3) containing material that is targeted for Finnish public. The policy is to create a representative sample of web contents over time and subjects. Domain crawls are supplemented by theme and event based harvesting. Contents of Finnish newspapers and news sites is harvested on a daily basis. The library may request a web publisher to give access to its web harvester (behind paywalls etc), or to deposit its web publications, when harvesting is not possible.
ACHIVE'S AVAILABILITY: The contents of the archive can be only accessed from special legal deposit workstations that are available in selected libraries within Finland (including The National Library of Finland).
Anyone can use the archive but digital copying of material from the archive is prohibited.
The National Library of Sweden started to harvest the web in 1997. One part of the archive consists of bulk harvesting of the Swedish web. The collection includes both web servers located under the Swedish top level domain "se" and servers located elsewhere. This second part is identified as Swedish using geolocation. Harvesting is done roughly twice a year. A second collection comprises about 140 newspapers with a daily issue. These are harvested every day.
The archive is open to everybody but only within the library.
The Library of Congress Web Archives (LCWA) is composed of collections of archived web sites selected by subject specialists to represent web-based information on a designated topic. It is part of a continuing effort by the Library to evaluate, select, collect, catalog, provide access to, and preserve digital materials for future generations of researchers.
The legal foundation for Netarchive.dk is the Act on Legal Deposit of Published Material of 22 December 2004. In order to collect the Danish internet as complete as possible three different strategies are followed:
1) Bulk harvesting (snapshots) 4 times / year
2) Selective harvesting of 80 - 100 sites, which are often updated and of special importance to the society (eg. news sites)
3) Event harvesting (eg. national and local elections).
Access to the archive is restricted to research purposes.
The National Library of Norway (NLN) started harvesting the Norwegian top level domain (.no) on a yearly basis in 2001. A revised version of the Norwegian Act on Legal Deposit came into force 1st January 2016 and it enables the NLN to do full domain harvests of the Norwegian top level domain (.no), as well as to collect websites outside the .no-domain that are either owned by Norwegian institutions or individuals, or adapted to Norwegian users.
Different harvesting approaches have been followed since 2001: 1) selective harvesting of web sites 2001-2004 and from 2009; 2) domain crawls once or twice a year since 2002; 3) event harvesting since 2001, for events of national interest, such as general and local elections, royal weddings etc.
Due to privacy protection, access to the Web Archive is restricted for the time being.
The New Zealand Web Archive forms part of the Alexander Turnbull Library's collection within the National Library of New Zealand.
Access to websites is available by searching the National Library's online catalogue and then clicking on the link to the archived copy.
"Online Archiving & Searching Internet Sources (OASIS)", a project designed to acquire online resources, such as web sites and web documents
PANDORA is a selective archive with a broad coverage of web materials relating to the social, cultural, political and intellectual life of Australia and Australians. It includes government sites, blogs, organisational sites, examples of commercial sites, some online newspapers and collections relating to events such as elections.
PADICAT is a repository destined to collect and preserve the entire cultural, scientific and general output of Catalonia in digital format, that is, to preserve Catalan websites and to guarantee their open and permanent access.
The Biblioteca de Catalunya (Library of Catalonia), that is the national library of Catalonia, initiated the Padicat project in June 2005, with the technological collaboration of the Centre de Supercomputació de Catalunya (CESCA) and the support of the Secretaria de Telecomunicacions i Societat de la Informació de la Generalitat de Catalunya.
The aim of the project is to acquire, preserve and make available knowledge and information on the Internet of the day for coming generations and to create the Web archive of Catalonia.
Following the adoption of the latest Legal Deposit Law in 2006, the National and University Library of Slovenia started archiving the Web in 2007 with a mission to collect and preserve Slovenian web heritage. Until 2014 web sites were crawled selectively only. Currently, around 1300 sites are crawled in such a way and occasionally supplemented with shorter thematic crawls. Web sites are crawled with yearly to monthly frequency and in the case of thematic crawls up to once a day. In the years 2014-2015 the national domain .si was crawled for the first time. Web archive of selective and themathic crawls is accessible online to anyone without restrictions. The .si domain crawl is currently accessible to the library employees only.
Archive of UK Central Government websites
The UK Web Archive is a corpus of websites selected by leading UK institutions for their historical, social and cultural significance, for the benefit of researchers. The archive is free to view and has already collected over 5,000 selected websites since it was set up in mid-2005.
The UK Web Archive is provided by the British Library in partnership with the National Library of Wales, JISC and The Wellcome Library. It also contains records contributed by the National Archives and the National Library of Scotland.
Web Archiving Project (WARP) has been archiving websites since 2002. The National Diet Library Law revised in 2009 and coming into force in April 2010, allows the NDL to archive Japanese official institutions’ websites: the government, the Diet, the courts, local governments, independent administrative organizations, and universities. Websites of cultural and international events held in Japan, and those related to electronic magazines, are also archived based on the permission of their webmasters.
The archived websites are also available on the premises of the NDL. A part of these archived is provided also on the Internet with the permission of webmasters.
The California Digital Library operated the Web Archiving Service from 2007 to 2015, during which time 28 WAS curators at the University of California and non-UC partner institutions built over 80 public collections totaling 113 TB. In 2015 the CDL decommissioned WAS and transitioned all WAS users to the Internet Archive’s Archive-It service. All former WAS collections are accessible in Archive-It by individual UC campus or non-UC institutional name. This transition will allow CDL to reallocate its staff towards new added-value collection, preservation, and access services that can complement and enhance Archive-It use.
KB as national library is responsible for collecting, cataloguing and archiving publications issued in the Netherlands. More and more publications are exclusively published in digital form, such as for example websites. This digital cultural heritage is under thread of becoming inaccessible in the (near) future. Therefore, KB sees it as its task to collect, archive and provide permanent access to websites.
KB selection of Dutch websites is based on its collection policy (Dutch history, language and culture). The selection focusses on websites containing scientific and cultural content. Another area of interest is innovative websites. A subsequent step will be to extend the by cooperating with other Dutch knowledge institutions.
KB uses a selective approach to web archiving for several reasons:
The National Library of the Czech Republic has been building the archive of the Czech web since 2000. The main aim of the Webarchiv is to implement a comprehensive solution in the field of archiving of the national web, i.e. bohemical online-born documents.That includes tools and methods for collecting, archiving and preserving web resources as well as providing long-term access to them. Both large-scale automated harvesting of the entire national web and selective archiving are being carried out, including thematic „event-based“ collections. Selective harvests are collection of resources with historical, scientific or cultural value manualy selected by curators. Collection is accessible online due to contracts with publishers. Access to the selective harvests is provided to anyone online via internet while the rest of the archive is available only to the library patrons onsite from the library building.
The new Austrian Media Law became operative in March 2009. This amendment to the law is the legal basis for web archiving and governs the collection of online publications. In principle the webpages with the domain .at and pages that have a specific connection with Austria (for example, the Austrian Cultural Institute in New York) are to be collected. The Austrian National Library will start the web archiving with a pilot phase that will then become a permanent service in 2010. Access will be possible for anyone on site at the Austrian National Library and approximately 20 other libraries in Austria.
Web Archive Switzerland is the collection of the Swiss National Library containing websites with a bearing on Switzerland. Web Archive Switzerland has been integrated in e-Helvetica, the access system of the Swiss National Library, giving access to the entire digital collection. So now you can do full text searching in Web Archive. But the archived versions of websites can only be viewed in the reading rooms of the Swiss National Library and of our partner libraries who help us build the collection of Swiss websites. But you can view the metadata of the archived versions from anywhere.
The first Web archiving pilot project of the National Library of Latvia was conducted in 2005. The Archive contains legal deposit copies of Latvian websites starting with the year 2008. We archive also foreign websites which content is devoted to Latvia. By the end of 2013, we have collected more than 1600 of unique websites and 3500 of target instances. Due to legal restrictions, the archived websites are available solely within the library premises.
To improve the search capabilities, such descriptive metadata as title, annotation, keywords, subject headings etc. are assigned to each website.