Skip to content

Web Archiving Community

Nick Sweeting edited this page Jul 12, 2019 · 137 revisions

🔢 Just getting started and want to learn more about why Web Archiving is important?
Check out this article: On the Importance of Web Archiving.


The internet archiving community is surprisingly far-reaching and almost universally friendly!

Whether you want to learn which organizations are the big players in the web archiving space, want to find a specific open source tool for your web archiving need, or just want to see where archivists hang out online, this is my attempt at an index of the entire web archiving community.

Contents


The Master Lists

Indexes of archiving institutions and software maintained by other people. If there's anything archivists love doing, it's making lists.


Web Archiving Projects

           

Bookmarking Services

  • Pocket Premium Bookmarking tool that provides an archiving service in their paid version, run by Mozilla
  • Pinboard Bookmarking tool that provides archiving in a paid version, run by a single independent developer
  • Instapaper Bookmarking alternative to Pocket/Pinboard (with no archiving)
  • Wallabag / Wallabag.it Self-hostable web archiving server that can import via RSS
  • Shaarli Self-hostable bookmark tagging, archiving, and sharing service

From the Archive.org & Archive-It teams

  • Archive.org The O.G. wayback machine provided publicly by the Internet Archive (Archive.org)
  • Archive.it commercial Wayback-Machine solution
  • Heretrix The king of internet archiving crawlers, powers the Wayback Machine
  • Brozzler chrome headless crawler + WARC archiver maintained by Archive.org
  • WarcProx warc proxy recording and playback utility
  • WarcTools utilities for dealing with WARCs
  • Grab-Site An easy preconfigured web crawler designed for backing up websites
  • WPull A pure python implementation of wget with WARC saving
  • More on their Github...

From the Rhizome.org/WebRecorder.io team

  • Webrecorder.io An open-source personal archiving server that uses pywb under the hood
  • pywb The python wayback machine, the codebase forked off archive.org that powers webrecorder
  • warcit Create a warc file out of a folder full of assets
  • WebArchivePlayer A tool for replaying web archives
  • warcio fast streaming asynchronous WARC reader and writer
  • node-warc Parse And Create Web ARChive (WARC) files with node.js
  • WAIL Web archiver GUI using Heritrix and OpenWayback
  • squidwarc User-scriptable, archival crawler using Chrome
  • More on their Github...

From the Old Dominion University: Web Science Team

  • ipwb A distributed web archiving solution using pywb with ipfs for storage
  • archivenow tool that pushes urls into all the online archive services like Archive.is and Archive.org
  • WAIL Electron app version of the original wail for creating and interacting with web archives
  • warcreate a Chrome extension for creating WARCs from any webpage
  • More on their Github...

From the Archives Unleashed Team


From the IIPC team


Other Public Archiving Services


Other ArchiveBox Alternatives

  • Memex by Worldbrain.io a beautiful, user-friendly browser extension that archives all history with full-text search, annotation support, and more
  • Hypothes.is a web/pdf/ebook annotation tool that also archives content
  • Reminiscence extremely similar to ArchiveBox, uses a Django backend + UI and provides auto-tagging and summary features with NLTK
  • Shaarchiver very similar project that archives Firefox, Shaarli, or Delicious bookmarks and all linked media, generating a markdown/HTML index
  • Polarized a desktop application for bookmarking, annotating, and archiving articles offline
  • Photon a fast crawler with archiving and asset extraction support
  • ReadableWebProxy A proxying archiver that downloads content from sites and can snapshot multiple versions of sites over time
  • Perkeep "Perkeep lets you permanently keep your stuff, for life."
  • Fetching.io A personal search engine/archiver that lets you search through all archived websites that you've bookmarked
  • Fossilo A commercial archiving solution that appears to be very similar to ArchiveBox
  • Archivematica web GUI for institutional long-term archiving of web and other content
  • Headless Chrome Crawler distributed web crawler built on puppeteer with screenshots
  • WWWofle old proxying recorder software similar to ArchiveBox
  • Erised Super simple CLI utility to bookmark and archive webpages
  • Zotero collect, organize, cite, and share research (mainly for technical/scientific papers & citations)

Smaller Utilities

Random helpful utilities for web archiving, WARC creation and replay, and more...


Reading List

A collection of blog posts and articles about internet archiving, contact me / open an issue if you want to add a link here!


Blogs


Articles

If any of these links are dead, you can find an archived version on https://archive.sweeting.me.


ArchiveBox-Specific Posts, Tutorials, and Guides

ArchiveBox Discussions in News & Social Media


Communities

Most Active Communities


Web Archiving Communities


General Archiving Foundations, Coalitions, Initiatives, and Institutes

You can find more organizations and initiatives on these other lists:


You can’t perform that action at this time.