Press J to jump to the feed. Press question mark to learn the rest of the keyboard shortcuts
64

ArchiveBox - Self hosting clone of Archive.org

64
Archived

ArchiveBox - Self hosting clone of Archive.org

ArchiveBox takes a list of website URLs you want to archive, and creates a local, static, browsable HTML clone of the content from those websites (it saves HTML, JS, media files, PDFs, images and more).

https://archivebox.io/
https://github.com/pirate/ArchiveBox

Live demo : https://archive.sweeting.me/

21 comments
92% Upvoted
This thread is archived
New comments cannot be posted and votes cannot be cast
level 1
222TB14 points · 7 months ago

It works great, but my biggest problem with it, is that it only saves one snapshot per link. Sometimes it would be really convenient, to have multiple snapshots per website, just like on archive.org.

It it's somehow possible with ArchiveBox, and I'm just missing that feature, it'd be great, if someone could tell me how to do that.

Other than that, it's great. You can export your Browser Bookmarks, and archive them with ArchiveBox. I use my bookmarks, among other things, for saving solutions for problems while programming or using certain programs. Often it's small private blogs, which might get deleted, or the blog software changes so that the articles move to another url. It's great to have a local copy in that cases.

level 2
25TB + Cloud5 points · 7 months ago

I had exact this problem.

I set the target to a git repository and commit automatically right after I have done an archiving run with start and endtime as comment.

So I can go back to each archiving run in my archiving history.

Is this a good approach? I don't know... Does it work? Yes and it's simple.

Bonus points: You can simply push your whole archive to another host with git push.

level 2
3 points · 7 months ago · edited 7 months ago

Thanks for your comments!

I'm the ArchiveBox creator (@pirate on Github). We're adding multiple snapshot support in the future, you can track progress here: https://github.com/pirate/ArchiveBox/issues/179

If you absolutely need to do it right now, the hacky solution is to re-add a link with #archivedtime=2019-03-18 (with a new date each time you want to shapshot it). It will treat it as a new URL create a new snapshot each time because of the different hash at the end.

level 2

It works great, but my biggest problem with it, is that it only saves one snapshot per link. Sometimes it would be really convenient, to have multiple snapshots per website, just like on archive.org.

This could be accomplished with cron jobs I think.

level 1
15TB2 points · 7 months ago · edited 7 months ago

I've run this before but really fell down at how to actually get the links in to it that i want archived.. i thought it would be as simple as just feeding it a URL but it seems you have to use bookmark files.... I just don't get this.. why would i want to take a URL, put it in to a bookmarks file, then move that file to the server running archivebox.. it seems counterintuitive... Or am i approaching this incorrectly?

Edit : So i'm an idiot and you just echo the URL via stdin.. might be nice to have a web gui interface to acheive the same though.

I am having problems though as :

:ERROR:zygote_host_impl_linux.cc(89)] Running as root without --no-sandbox is not supported. See https://crbug.com/638180

Edit: fixed : env CHROME_SANDBOX=False ./archive

level 2
2 points · 7 months ago · edited 7 months ago

Web GUI will be added in the future, see our roadmap here: https://github.com/pirate/ArchiveBox/wiki/Roadmap#major-long-term-changes

Also, I've updated the Docs to mention the sandbox error you encountered:
https://github.com/pirate/ArchiveBox/wiki/Configuration#chrome_sandbox

level 1
45TB (55TB Raw) + 2TB1 point · 7 months ago

I use ArchiveBox myself, very great tool but the results aren't always as good as archive.org and no way to do multiple snapshots per site easily.

level 2

It was blocked by an issue that will be closed soon, so multiple snapshot support should be added in the near-ish future now, you can track progress here: https://github.com/pirate/ArchiveBox/issues/179

level 1
0.25PB1 point · 7 months ago

I am running ArchiveBox to and save the warc files in the Cloud and ArchiveBox Show this Files "on the fly" works good. but what I miss is the function of snapshoots/versioning.

And some Pages that Block scraping through robots.txt dont Show correctly.

level 2

We started ignoring robots.txt blocking recently, so if you haven't updated recently you should be able to pull the latest master and have it work now.

You can track progress on multiple snapshot archiving here: https://github.com/pirate/ArchiveBox/issues/179

level 1
26TB1 point · 7 months ago

Is it possible to make it save copies of images on other servers? Like if a website has content hosted on Shopify does this save those images or just links to them?

level 2
1 point · 7 months ago · edited 7 months ago

It saves everything needed to render the page, including assets on other servers / domains.

To only save links to 3rd party resources without downloading them, by set FETCH_WGET_REQUISITES=False. https://github.com/pirate/ArchiveBox/wiki/Configuration#FETCH_WGET_REQUISITES

level 1
21tbs-1 points · 7 months ago

!remindme1month

level 2

RemindMe! 1 month

More posts from the DataHoarder community
Continue browsing in r/DataHoarder

171k

Members

891

Online

May 10, 2013

Cake Day

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.