Hacker News new | past | comments | ask | show | jobs | submit login
Internet Data Is Rotting (theconversation.com)
121 points by walterbell on May 17, 2019 | hide | past | favorite | 113 comments



This needs to be solved on the protocol level. Of course, the players who have control over our protocols are exactly the people who don't want this to be solved at all.

The next best thing would be to redefine what "bookmarking" is. When I bookmark a page, I want it to be permanently stored on my local machine and full-text indexed. In fact, it's rather ridiculous that after 25 years browsers don't have anything of this sort. Unfortunately, the most popular browser in the world is controlled by the same people who control our protocols.

If I ever get the energy, I will attempt to write a browser extension for this.


Shameless plug: I built an app for that. It’s not super polished but it is at a point where I use it to store technical documentation that doesn’t ship with the software (aka docs for every Javascript package I use). It works for moderately complex JavaScript apps as well.

https://github.com/CGamesPlay/chronicler


Check out https://blog.webmemex.org/. It serializes pages, inlining their external resources, full-text indexes them, and saves them locally. Browser extensions are available for Firefox and Chrome.


You're describing IPFS pinning.

https://docs.ipfs.io/guides/concepts/pinning/


Best use of time and money is probably http://pinboard.in/ ?

Seems like someone I can trust not to sell out and shut down.


But if they die?


...or just run out of money?

Storage must be local, physicslly user-controlled. All online stuff is a backup only.


On top of this there is currently a push for a mix of censorship and 'compulsory removal' on the internet. For instance the Canadian Supreme Court ruling Canada can force Google to remove worldwide results for something, the UK censoring increasingly arbitrary things, or various 'right to be forgotten' type laws. And of course there are also intellectual property lobbyists and special interests pushing for ever more draconian rules on that front.

Any centralized online site that aggregates other data has an, at best, unclear future ahead of it. Really though, centralized is the key word. The internet so desperately needs to migrate to decentralization but we're collectively about as inertial as an 882 foot long, 46,328 ton tanker, built in 1909.


why cant backup be local, user controlled?

revenue looks like hes doing just fine. https://blog.pinboard.in/2017/07/eight_years_of_victory/

https://news.ycombinator.com/user?id=idlewords


Saving data locally is not a protocol problem.

Chromium is almost there.

Chrome had “chrome://flags/#show-saved-copy” which would do you want if you could make Chrome's cache arbitrarily large and persistent, but it appears to be gone now.


Check out Beaker Browser. Solves this problem on the protocol level and bookmarking automatically stores a copy.


Use the wayback machine browser plugin allowing you to conveniently use the "save now" function which immediately saves the website to archive.org https://archive.org/web/


There used to be a Firefox extension called Scrapbook. I think it was one of the winners of one of the two first extension competitions.

It could easily download entire sites or just a subtree, including resources. It would the rewrite URLs as necessary to point to the downloaded documents instead of pointing to absolute URLs on tje Internet.

Still hoping that Mozilla will improve its extension API. I mean I can see the reason for not allowing everyone to pone around everywhere under the hood, but the current API is rather anemic it seems.


Been using it for so long. Currently it's Scrapbook X under Pale Moon (that I keep around just to run that).

Looking for an alternative that would import my thousands of saved pages.


Maybe I don't understand what you are saying, but I safe webpages all the time. Safari the creates a .webarchive file with everything in it. Isn't that what you want?


No. Because that's extra work. It should just be automatically done whenever I bookmark a page. In fact with disks so big now I could just dedicate a terabyte to just caching everything I visit. Make that indexed and searchable and a lot of what I use would be forever available offline. Pages could expire automatically if unvisited for some period of time, perhaps weighted by the number of times they were visited.

Perhaps I should try to resurrect my rusty and incomplete Firefox add-on skills.


https://historio.us

AFAIK there’s no self hosted equivalent. Ive been using it for years. Whenever I put a link in my docs / notes I use that site to cache a copy as well.

When I first signed up I wrote a script to convert my bookmarks to a text list so I could import them. Around 25% of the pages I’d bookmarked were gone.


Tangentially related, but the reader view of a page seems ideally suited for archival bookmarking. It's unfortunate that so many sites seem to be intentionally breaking the reader view functionality in order to maximize newsletter subscriptions and eyeball monetization. Recipe sites are the worst offenders IMO.


Pocket seems to have luck where other reader views and outlining services fail, and you can set up IFTTT to download pocket articles as html files to dropbox, but that's certainly not automatic and completely out of your way. I would have preferred it to archive as a plain text document but I couldn't figure it out through IFTTT.


Are the downloaded pages in original or Pocket/Reader view?

You can export (with some pain) your Pocket list. Fetching sources from there is ... somewhat painful. I've done it a couple of times, but it's not something I relish doing.


Pocket Premium stores the web pages on a server. If there was a way to batch export and access them locally, I'd gladly pay for it.


I’ve been telling people this for years. Bookmarking is for dynamic data. You bookmark the weather website. You don’t bookmark recipes. Personally I use Evernote and its web clipper. Unfortunately it doesn’t download images, but I wrote a little program using their API to replace the img tags with embedded resources.


wget -r ?


No. There are some cases where it is useful to download many pages in a batch, but what I am talking about is, effectively, partial local replication. Bookmarking I describe should create a tiny (but useful) version of the web on your computer.

It needs to be seamless. It needs to be searchable. It would be incredibly useful if it would capture relationships between pages (links) in addition to pages themselves to navigate offline.

I can see many use cases for a kind of "super bookmark mode" where the browser automatically stores all the pages you visit within a certain domain.


I would loooove this. Open one of your bookmarks and you can't tell it's the local copy unless you looked at the local path sitting in the URL bar. Maybe make an option to fetch a live copy though for sites like the front page of HN that are in constant turnover, but it would be great. Storage has never been cheaper, let me hoard!

You can even take it a step further and create a viewable timeline of all the pages you've visited in case you have something at the tip of your tongue, but need to retrace your steps and logic to get there. Browsing history is kinda lackluster for this. My last ten entries on firefox are three hackernews articles with "Add Comment" dispersed half a dozen times in that list. If I had something along the lines of zoteros timeline for papers I could probably find stuff way easier.


HTTrack does most of that, minus "seamless". You could probably write an extension that automatically sent bookmarked urls to HTTrack for mirroring/archival, then they would all be browsable offline.


You can't seamlessly replicate a REST server.



> As of last fall, its Wayback Machine held over 450 billion pages in 25 petabytes of data. This would represent .0003% of the total internet.

> Universities, governments and scientific societies are struggling to preserve scientific data in a hodgepodge of archives, such as the U.K.‘s Digital Preservation Coalition, MetaArchive, or the now-disbanded collaborative Digital Preservation Network.

Like any conservation work, the benefits are incredibly easy to ignore - until something goes awry / stops getting funding and suddenly it's too late. Consequently it's easy to have a myopic view of the issue.

These organizations are doing very important work, and I hope that internet users and governments don't take them for granted.


We should aim at browsing the Internet by date. We're moving everything there ignoring the fact that as it is there is no built-in permanence. We're accustomed to that after Gutenberg: it wasn't that easy losing every copy of an important document. Now it is, things disappear and we're getting into a drifting cultural bubble impossible to trace back.

The Internet Archive is doing God's work, but it's not enough. If you don't have the URL of a site that is gone, you probably won't find any reference to it after every online hyperlink to it has disappeared as well. It might become then inaccessible after a while, stored yet gone anyway.


"Browsing by date" is pretty much Brewster Kahle's ideal, and is what the Internet Archive's Wayback Machine approximates, thanks in large part to WARC storage.

I'd also like to see a distinction between the idea of web servers, which are really publishers, and where archives are kept. Ideally not all in one single store, a/k/a the Internet Archive, but replicated fairly widely.


> We're accustomed to that after Gutenberg: it wasn't that easy losing every copy of an important document.

The effort has been made. Look at what happened with Tibet:

> Caroe pulled off a rather notorious subterfuge in order to buttress the British claim to Tawang: he published the Simla Convention for the first time in 1938 with a note misrepresenting that it had included settlement of the border (and alienation of Tawang); and he arranged for the publication of official Survey of India maps that, for the first time, showed the McMahon Line as the official boundary. To advance the narrative, he also corresponded with commercial atlas publishers to put the McMahon Line on their maps as well.

> In a telling indication of Caroe’s jiggery-pokery, to avoid the awkward question of why he was first publishing the Simla Convention twenty-four years after the fact in 1938, he instead arranged for the surreptitious printing of a spurious back-dated edition of Aitchison, deleting the original note about the Chinese government’s non-signature, and replacing it with a lengthy note stating, quite falsely, that “The [Simla] Convention included a definition of boundaries…”

> Since 1) the McMahon Line had been concluded in secret bilateral negotiations between Tibet and Great Britain outside the Convention and 2) the Chinese had officially refused to recognize any bilateral agreement, boundary or otherwise, between Tibet and Great Britain and 3) had declined to sign the Simla Convention itself and 4) had notified Great Britain in 1914 that the specific sticking point was “the boundaries” this was hoo-hah.

> The replacement copy was distributed to various libraries with instructions to withdraw and destroy the original edition.

> The subterfuge was only discovered in 1963 when J.A. Addis, a British diplomat, discovered a surviving copy of the original edition at Harvard and compared it to Caroe’s version.

( http://www.unz.com/plee/the-myth-of-the-mcmahon-line/ )

Wikipedia confirms this, if you look hard, in a shockingly non-judgmental way:

> Simla was initially rejected by the Government of India as incompatible with the 1907 Anglo-Russian Convention. The official treaty record, C.U. Aitchison's A Collection of Treaties, was published with a note stating that no binding agreement had been reached at Simla. Since the condition (agreement with China) specified by the accord was not met, the Tibetan government didn't agree with the McMahon Line.

> The Anglo-Russian Convention was renounced by Russia and Britain jointly in 1921, but the McMahon Line was forgotten until 1935, when interest was revived by civil service officer Olaf Caroe. The Survey of India published a map showing the McMahon Line as the official boundary in 1937. In 1938, the British published the Simla Convention in Aitchison's Treaties. A volume published earlier was recalled from libraries and replaced with a volume that includes the Simla Convention together with an editor's note stating that Tibet and Britain, but not China, accepted the agreement as binding. The replacement volume has a false 1929 publication date.

( https://en.wikipedia.org/wiki/Simla_Accord_(1914) )


Things got far more worse in China right now.

Baidu Tieba, which could be considered as Reddit for China, just made all posts before 2017-01-01 inaccessible. And a number of other online forums are doing the same thing due to political reasons.


would you elaborate on the political reasons?


If you want to rewrite history it doesnt help to have lots of old copies lying around. Any data not under government control is a threat.


Isn't Baidu under government control anyway?


Yes, but the effort to go through and scrub that much data to ensure only the ideas you want out are getting out would be massive. Seems easier to just shut the door and board up the room than to try and clean it.


If zshbleaker doesn’t respond, I’d guess that over time what’s acceptable and not acceptable changes (e.g. opinions about particular people who may have fallen out of favor, or Taiwan policy toggling between bellicose and conciliatory, etc) so rather than getting in trouble for providing acess to “inappropriate” content you simply ditch it.


Back in the mists of time, I used to use wwwoffle proxy. It was great for low-latency links, but also had the benefit of keeping an offline archive of whatever you'd browsed.

Project's still there, although not sure how well it does with the modern web.

http://www.gedanken.org.uk/software/wwwoffle/

There are a bunch of more modern variations too:

https://archivebox.io/ - "Your own personal internet archive"

https://getpolarized.io/ (as seen on HN previously)

https://github.com/kanishka-linux/reminiscence

https://github.com/fake-name/ReadableWebProxy


Sadly, a lot of old-school proxies (squid, privoxy) are stymied by SSL/TLS connections.

I think we're due for the idea that a proxy can be designated as a trusted intermediary, most especially if it's run on a personal basis. I'm sure this presents security issues, but it also avoids some.


> I think we're due for the idea that a proxy can be designated as a trusted intermediary, most especially if it's run on a personal basis.

We have that idea now; you designate the proxy as a trusted intermediary by accepting its certificate. The chain looks something like this:

    You: browser, take me to https://youtube.com
    Browser: proxy, get me https://youtube.com
    Proxy: YouTube, get me /
    YouTube: I'm youtube.com -- here's a certificate signed
             by the government of Egypt that proves it. And
             here are the contents of /
    Proxy (to browser): I'm youtube.com -- here's a
                        self-signed certificate attesting to
                        that. And here are the contents of /
    Browser (to user): SECURITY ALERT! SECURITY ALERT!
Configure your browser to accept that certificate, and your proxy can handle its own connection to youtube and just pretend, to your browser, that it is youtube.


Does Chrome (and other browsers') MITM attack defeat prevent this? That's my understanding.

https://comodosslstore.com/blog/google-chrome-63-will-warn-y...


I'm answering based mostly on having read that link. It looks like the protection applies only in the case where an error is being surfaced. The problem Chrome wants to address is that users will click past the SECURITY ALERT.

If you properly configure your own CA, then the TLS error triggering this behavior won't occur, and there is no security problem for Chrome to put its foot down on -- your proxy is providing a valid certificate for whatever domain, as far as Chrome is concerned, not an invalid one.

Compare https://support.portswigger.net/customer/portal/articles/178... .

> The Chrome browser picks up the certificate trust store from your host computer. By installing Burp's CA certificate in your computer’s built-in browser (e.g. Internet Explorer on Windows, or Safari on OS X), Chrome will automatically make use of the certificate.

> When the Burp CA certificate has been installed for your built-in browser, restart Chrome and you should be able to visit any HTTPS URL via Burp without any security warnings.


Thanks. This is something I've got some plans on.


Yeah it’s annoying that links get broken. But maybe it’s better this way. There’s something about modern tech that has turned all of us into digital horders. I (we?) have backups and backups of backups and redundant RAID servers with every version of every file so that no byte shall ever perish. I have essays still that I wrote in highschool nearly 20 years ago. To what end? I’m partial to material minimalism. Why not data minimalism?


My common response is that there are real world costs to having too much stuff, data is pretty small to keep around. You can fit a RAID 5 box with 30 TB of space in a shoebox for $15/month, that is enough to keep pretty much any content that you ever consume so that if you want anything it still exists. My parents and grandparents hoarded files and documents to no end and I have been going through some of them and there is mostly garbage but there is some real gems in the rough. Willingly disposing of the internet through our own negligence is something I don't advocate for because there is some sort of value preserved in what we save for the next generation.


There are real-world costs to information hoarding as well, when it's done by people who are not you and whose incentives are not aligned with your own.

I'm glad that data is rotting on the Internet. In fact, I'll go one step farther and say that data should rot more quickly on the Internet. There's no reason that some scummy marketing algorithm should have access to my high-school social media posts. There's no reason that governments or private individuals with a grudge ought to be able to through the history of everything I've written, no matter how off-the-cuff, in order to dredge up something that makes me look like an undesirable when it's taken out of context.

If an individual wants to save a particular piece of information, that should be their choice. Otherwise, by default, information ought to disappear from the Internet.


I think that in terms of what you are talking about bitrot is good. People should have their stuff deleted after a while but there is a sense of permanentness in internet culture and that should change to one of fleetingness which more bitrott may actually bring about causing more people to back up their own stuff instead of leaving it to these companies.


Oh, that's a simple one: lack of time. I too have some very old data in an archive that's on some old hard drive somewhere in a storage. Assuming the HD still works, I could go in there, destroy the data, and trash it. But I'm fairly certain I will never find a moment to do so.

Managing what data you keep is just too time consuming. Hoarding it all is simpler.


> Then there is also a problem of software preservation: How can people today or in the future interpret those WordPerfect or WordStar files from the 1980s, when the original software companies have stopped supporting them or gone out of business?

This issue in particular we have great solutions for (open formats / text), but they are of course less profitable than only-my-app-can-read-this formats.


FWIW those particular formats are widely understood even if they are proprietary (well, at least in WordStar's case). And as long as the software runs (be it natively or via an emulator or VM), you can always open and convert/print the files (e.g. you could use vDOS to run WordStar or whatever and use its printer emulator functionality with Windows' PDF printer to create a PDF from the WordStar files).


I read somewhere that the lifespan of the average hyperlink is only about two years.

I count myself lucky I was introduced to the HTTRACK archiver program many years ago and thus have complete offline copies of many of my favorite websites of the early 00's.


Can you give some examples of these 'favorite websites'? I'm interested in knowing what kind of website would be so interesting that I would want an entire offline copy of it. (Besides maybe Wikipedia)


Mostly defunct webcomics but also some of the small personal sites that documented and collected resources for particular events or strange people. A lot of those arose out of the SomethingAwful forums. For example, there was a guy named Brian who wrote batshit insane fanfiction about himself. One of these sites archived the fiction, interviews with Brian, videos, recordings of collaborative reading skype parties, etc, all neatly on one site and now safely tucked away on my drives. Now I have a little piece of nostalgia from 2004 I can step back into.


Are you able to navigate through the sites using the original links? I notice that on the Wayback Machine, internal site links only work if that particular page was also archived.


Yes. It also lets you designate a particular number of "steps" outside the original site it will also archive. So I give it a site and a 1-step limit, I'll get the site and also any individual other webpage linked to somewhere on the original site. It doesn't do so well with modern sites that full of CDN-hosted content and pages that depend on data from two dozen different domains to function properly but it's great for old pre-web2.0 stuff.


The Rotten Library is the first thing that comes to mind.


I run a music review website in my free time and I'm extremely envious that you were able to archive that stuff. The indie music sites of the early '00s (pre-social media) were a goldmine of non-corporate DIY journalism and analysis that simply doesn't exist anymore.



you should upload them somewhere for posterity


Similarly, also even today i use wget --clone and the Firefox addon Save Page WE to save interesting pages (the latter works for single pages, but it is useful for blog articles, etc).


I’m okay with internet rot and you should be too. I’m not sure where we got the idea that “our data must be preserved forever”. This can be especially harmful for teens and young adults whose indiscretions now follow them forever.

Think of the privilege you had when you were younger. You could do something stupid and nobody could whip out a high def camera to record it and make it part of your history forever.

Let it rot.


I'm OK with it because otherwise you are whitewashing history.

For example I have recordings of the Colbert report going back to ~2005. Some of his skits released during that time would be classified as "hate speech" in 2019. Of course he, and mainstream broadcasting companies would love it if you didn't think about that. There are plenty of news clips and interviews where mainstream politicians (on Left AND Right) casually dismiss gay marriage. Powerful tech influencers like Mark Zuckerberg would love it if his IMs would disappear from the Internet. The examples go on and on.


I think if the last 10 years have taught us anything, it's that preserving the past does nothing to impede the changing moral zeitgeist. More records simply mean more people to attack for holding an opinion that has simply gone out of fashion. If the past decade wasn't characterized by tribalism and moral hysteria I'd be more inclined to worry about stringent preservation.

At this point, I'm not really comfortable with what we're preserving.


Embrace humanity. It's not that nice, but pretending otherwise only puts you at a disadvantage for understanding and dealing with the world.

Our history is an inseparable part of who we are, and it should not be forgotten.


Time heals all wounds, except of course wounds preserved in digital formaldehyde.


Though I bet I largely share your feelings, I feel like there is an interesting converse in terms of distant history looking back.

Things like HP Lovecraft renouncing racism is only preserved because he wrote so many damn letters. It doesn't make the racist stuff he wrote less racist, but it allows a really clear view on how worldviews could change at the time. Perhaps the things we are storing will have historic or philosophical value.

I miss real privacy though.


I think those are all fair points. I just wonder that now we have the same volume of content (if not quality) from many, many more people. I don't believe there's a lot of time and patience for those people, but the 'landmines' of their unpopular writing is accessible despite this.


That's just it! We Should elect people who have a stronger moral code than it takes to be fashionable.


It's simple. Don't say ignorant things on the internet, especially when it's tied into your real life identity. Lots of people struggle with this simple rule, because they have a compulsion to share their own myopic opinion (usually the same viral opinion as everyone else in their echo chamber) and contribute to the noise. If I was a public figure I wouldn't be on any social media platform at all, at best it wastes your already limited and highly valuable time.

That being said, there's a lot of valuable information on the internet that is absolutely worth preserving. Scrape off the layer of social media and the internet is still a place of learning and problem solving. I do my own repairs on my car and the amount of times I've found a well written photo essay on a particular fix in a random honda forum, only to find the linked pictures broken, is astounding and a shame to say the least.


Who gets to decide what's ignorant? What if I'm being sensible about what's ignorant, but I can't predict how society will feel in the future?


James Burke, better known for his Connections series, also produced on called "The Day the Universe Changed". In it he explores how the changing understanding, and sometimes simply views of the world, humans' place in it, and what is right and wrong, change the Universe (or at least our perception of it) itself.

One segment from the final episode featured on HN a few weeks ago. It references the burning of witches, in Scotland, from earlier in the same episode. Burke's point was that a set of beliefs which had once carried legal force no longer does. That route can run in reverse, and it's quite possible for beliefs, or actions, previously legally sanctioned (or at least not criminalised) to become otherwise. Prohibitions against ex post facto laws limit this in many jurisdictions, but changes in social mores are less forgiving. I've lived long enough to see some remarkable changes along several such lines, and have witnessed social and political upheavals elsewhere that are even more pronounced.

There's the question of whether or not it's the fact of a durable record, or the abuse of power through that record, that is the real root of the problem. As Aesop noted, any excuse will do for a tyrant. But the persistent record does seem to pose certain problems. Particularly against a shifting set of judgements.


> It's simple. Don't say ignorant things on the internet, especially when it's tied into your real life identity.

Kind of like saying “People who go to prison just shouldn’t commit crimes.” I don’t believe in making a case equivalence here.

Being publicly prosecuted has just as much to do with “who” you are as “what” you did. I’m personally not fine with this distinction.


> Some of his skits released during that time would be classified as "hate speech" in 2019.

Yeah, I'm going to need some proof for that. Eddie Murphy's '80s HBO standup routines are available in full on YouTube, and some of that material is absurdly homophobic. He isn't a pariah by any means.

People do see nuance and understand that values and mores change over time, and what might have been socially acceptable at one point, is no longer so.


Eddie Murphy is black. Like Kevin Hart is too. GLAAD and the left are very careful in calling out homophobic views of black people, and instead of calling for their resignation or "cancelation" as they would for a white person, they are instead demanding for "this to be a teachable moment"

> GLAAD: Kevin Hart ‘Shouldn’t Have Stepped Down’ as Oscars Host, but Used Gig to Bring Unity

https://www.indiewire.com/2018/12/glaad-kevin-hart-oscars-ho...


You'll have to provide something more concrete than "the left lets black people off easy" when your parent comment indicates that Stephen Colbert, a white man, is supposedly getting off easy as we speak.


Why is it so important to blame someone now for something they haven't said publicly in over 13 years?


Because he said it on public broadcast. And i believe public broadcasts are the least likely to be deleted from the internet. Private stuff , or things said in a small circle of social network friends are much less important.


Personally I reserve the unalienable right to change my mind. What do you do when you’re wrong?


The scariest people are the ones who never change their mind on anything.


If you do it in a very public way then other people also have a right to very publicly call you out for that. Nothing personal or binary, just typical public character assassination


> For example I have recordings of the Colbert report going back to ~2005. Some of his skits released during that time would be classified as "hate speech" in 2019.

Sanity check for 2019:

* "hate speech" should be a link

* when I click it, it should accordion out into links to recordings of Colbert Report that you classified as "hate speech" queued up to the relevant skits

* when I click one it should start streaming the data from all the peers who care about preserving old Colbert Report videos

* Copyright law should have changed a decade ago to protect people who rebroadcasting exact copies of data that the copyright owner/licensee initially publicly broadcast (where "exact" is verified by checking hashes)


That's fine only if these records are used as broad research tools that help provide context to a period in history or a public figures's life story.

If it's just going to be used as witchhunt fodder however than you can fuck right off with that.


The character from the Colbert Report was a parody.


I encountered rot today when I was trying to repair a set of speakers. I found the guide on some audiophile forum, step by step with pictures for every removed screw and multiple angles. Only the pictures were originally hosted on photobucket and since retroactively removed.

Some reddit users are egregious about it, installing scripts that overwrite their comments after x amount of time, seemingly oblivious to the fact that every edit on reddit can be found through archival tools. The solution is to take better care to not conflate your anonymous online persona to your real life persona—just don't post identifying information publicly online and you will be head and shoulders above many users on the internet in terms of privacy. There's no need to purge the internet of it's collective knowledge and history.

We are very lucky to have the wayback machine preserving this stuff from dissapearing into the void, but it doesn't cache everything on the internet, especially if that forum I visited had shut down and became impossible to find in a search result.

Side note: Is there an extension or bookmarklet available to automatically pull a web archive?


Rot may actually be useful by allowing out-dated, inaccurate, or useless information to be removed or fall into obscurity.

Correct, useful, or noteworthy information is often re-posted and copied, leading to preservation.


It should still be at the discretion of the reader of what is correct and valid. In schools we are given the opportunity to learn how to critically think, vet sources, and how to seek out expert authorities on a matter (whether people practice these skills after their essay writing days are done is another can of worms). There is a lot of out of date material in libraries, for instance, but there is value in archives. Plus, who is going to copy a niche forum thread and repost it elsewhere? Most people just paste a link when they share information on the web, not clumsily copying and pasting full content. I bet whoever originally wrote that guide over a decade ago has no idea that it's rotted and useless. That information is gone unless someone knowledgeable spends their time crafting another guide, repeating the work that another already did a decade ago.

There is value in archiving things beyond just relying on the small % of things that get actively reposted elsewhere. A good local caching solution of everything you've viewed on the web would be extremely useful for your personal reference, especially when sites frequently close and links die.

If you really wanted to fantasize you could seed your cached site data using bittorrent to maintain a common web archive with everyone on the internet, free from a reliance of one lumbering archival corporate entity that might disappear on the whim of a VC or shareholder, free from commercial profiteering practices, free from the ad men and anyone else who's job it is to pollute and dilute technology for profit, and firmly in the hands of the users of the internet.


You are potentially conflating two different issues.

> Think of the privilege you had when you were younger. You could do something stupid and nobody could whip out a high def camera to record it and make it part of your history forever.

This is an excellent point, and I agree 100%. But it's not a priori obvious that we need to embrace internet rot to solve this problem. Perhaps we can work towards a future where privacy rights and digital literacy protect individuals from offense archaeology, and at the same time, work to preserve the knowledge, ideas, and sheer human weirdness that get posted every day (as long as the creators don't want it removed. My understanding is that the Internet Archive has explicitly stated that they are not interested in preserving stuff where there has been a takedown request.)

One might argue that we have to make a choice. This is touched on in the article - redundancy presents a tradeoff between persistence benefits and security/privacy risks. But we should examine this tradeoff more closely and see if there isn't a healthy middle ground before we ask for one side wholesale.


Letting MySpace rot is fine.

As the article points out, a more concerning issue is that “universities, governments and scientific societies are struggling to preserve scientific data”.


There will never be enough time to analyze that data anyways. Scientists are struggling to analyze their own data already. I think that is a known failure of contemporary science: too many small experiments that are often replicating each other.


I just went on my Myspace account for the first time in probably 10 years. It was like going into a broken time capsule. Some pictures from a different age, others are gone.

Such is life?


I think that ship has sailed. All our most personal data is being archived forever competently by multiple parties.


Good thing we are going to die and it won't matter.

So let's think about what we can do for the generation that is about to be born.


Can you think of an even remotely feasible solution? As long as we have the technology, I don't see this stopping.


Europe seems to be going in the right way with the GDPR, no?


Among the handy tools that can be used to save and access present, at-risk, and/or rotted data, are bookmarklets.

I've recently added two to my browser, "open in Wayback Machine" and "Save in Wayback Machine". Respectively:

    javascript:void(window.open('https://web.archive.org/web/*/'+location.href));

    javascript:void(window.open('https://web.archive.org/save/'+location.href));
This makes opportunistic archival and reference easy. There are also Wayback Machine / Internet Archive browser extensions.

(These are from the Internet Archive, not my work.)

For bulk archival, lists of URLs can be submitted to the IA's save address:

    https://web.archive.org/save/<URL>
(Used in the bookmarklet above as well.)

This can be automated with a simple shell script using any console or script-based HTTP agent, such as curl, wget, lynx, etc.


I actually have a different problem -- not sure it is one that I can legally solve.

I have 10 years of lovingly curated YouTube videos playlists, which, now when I look into the older ones, are a barren wasteland of "Video removed" or "Video not available". It is heartbreaking. Is there any way I can prevent this from happening?


I’d download and store the videos locally with youtube-dl.


I concur, youtube-dl is what I’ve been using during the last few years: whenever I find a YT video I might want to watch again, I now immediately download it. Learned the need for that the hard way.

Check out its options here: https://github.com/ytdl-org/youtube-dl/blob/master/README.md...

With --add-metadata you can embed the YT video description in the video file. The downloaded video file name will contain the YT identifier so you can still match them back if needed.

There is another option to save the metadata to a separate JSON file if you prefer that.

To download your playlists, give it each playlist’s URL instead of the video url:

    youtube-dl --add-metadata --ignore-errors 'https://www.youtube.com/watch?v=8GW6sLrK40k&list=RDQMc4l8l2aQrNo'
That example URL includes a specific video from the list, but will download all of them. It works just the same if you only give it the `list` parameter, but all links to playlists I’ve seen point to one of their videos.

The option --ignore-errors will jump over the unavailable videos instead of stopping.

Edit to add: If you want to download your playlists as separate directories, with each video file name including its original index in the playlist, see these examples in youtube-dl’s documentation: https://github.com/ytdl-org/youtube-dl/blob/master/README.md...


I think it's good that data is lost . Only items where someone gives enough of a duck to save it should be preserved. It's not as if physical paper content, which ends up recycled or in a landfill 99.999 of the time is any different. It's true that digital formats change but fighting that is the cost of preservation. A museum of software needs to also preserve the context on which software was run in order to save it from the mists of time, albeit temporarily.


I feel like you've never had to look for information or how to do something, and the only decent source is entirely gone or consists mostly of pictures, which are also mostly gone. Maybe somewhere at some point in time somebody saved it, but that copy gets lost and never makes it's way online again so you can find it. A lot of information is being lost in this way and I'm not sure why we should be fine with that.


This seems like a good time to plug running a storage server on your local network. You can pick up old workstations off eBay for 100$. Stick a couple of drives in it, load it up with data to preserve and then put encrypted backups in the cloud. Back blaze B2 is something like .001$ per gigabyte.

It's a fun experiment with clear, practical use.


Not quite that cheap 0.005$ per GB. USD 6 per month for personal unlimited backup though.

Trying the trial now.

Thanks for bringing it to my attention.


I hate to be this guy, but isn't this why printers and physical books exist?


If books are all we need, why did anyone bother creating the Internet?


Computers were a mistake.


The recent shutdown of Google+ was another case of this.

As one of the people helping coordinate information among those still using the platform and hoping to migrate off of it, discovering the Archive Team's GoogleMinus project this past January was a huge boost. That ended up being the largest archival project undertaken to date, 1.6 PB, and succeeded in capturing 98% of all G+ profiles, now stored at the Internet Archive.[1]

While it had long been obvious that the project was ill-stared, the shutdown announcement came as a surprise, and Google's tools, communications, and support for both individuals, and far more importantly, groups, looking to continue their existence off the platform, was abysmal.

I don't fault Google for killing the service -- I was suprised it survived as long as it had. I do fault Google for how they did so. And that episode was hardly the worst in history.

One of the lesser-known parts of G+ were its Communities. In the process of the shutdown we came to realise that there were over 8 million of these, about 50,000 with 1,000 or more members, of all descriptions. Many frivilous or worse, but many also not. And all stuck in a very hard spot by Google's actions.[2]

Even preservation of individual data does very little for groups, and is one of the issues we're considering in the post mortem of the G+ mass migration, intended to be of use to others.[3]

________________________________

Notes:

1. For those preferring not to have their content archived, the IA WBM respect DMCA requests, and as Google+ posts are all listed under the user's account, requesting removal is exceedingly straightforward.

2. Characteristics of number and size are collected here, compiled by me, based in part on on data provided by Friends+Me: https://social.antefriguserat.de/index.php/Migrating_Google%...

3. Discussion at Reddit and elsewhere. Compilation at the PlexodusWiki. https://old.reddit.com/r/plexodus/comments/boa97x/g_migratio... https://social.antefriguserat.de/index.php/G%2B_Migration_Po...


Another problem of the present-day WWW is, even archiving all the data is far from enough to preserve the history! That's because the Web has a dual-role, (a) as a protocol, or a medium of communication, and (b) as the software, or the user-interface.

A good history preservation should allow you to somehow "browse" it, as if the historical system is still alive. How the website worked, how it was used, that's all parts of the history. If old operating systems and programs are preserved, there is no reason not to preserve websites in this way.

Back in the old days, many systems are federated and/or distributed, which means the software and the protocol are two separate entities. You use a newsreader, which talks the NNTP protocol to obtain news from a Usenet newsgroup. If you want to preserve history, you can (a) archive the newsreader program with source code, and (b) archive all the data on the NNTP server. That's exactly what has been done already, if you load a Usenet archive to your newsreader, pretty much you would have the experience similar to how Larry Wall browsed the Usenet back in the late 80s, at worst you need to rewrite a compatible "mock" server, but that's all. On the other hand, little of the early BBS systems have been preserved, once the server is gone, everything is gone.

The transformation to the web, means now the platform (a web community) = protocol (backend database format) = user interface (HTML/CSS), they're all tightly coupled together. It creates several problems:

(1) The "internal state" cannot be archived. A website is a system with constantly updating parameters, and often they are not stored. Simple examples: (a) On Hacker News, I cannot see what was shown on the frontpage yesterday retroactively, (b) A user changes his/her avatar, now we had no idea how the old avatar used to look like, and (c) an early user has been banned from the forum, now his/her personal profile is inaccessible, (d) on some social media platforms, sometimes a old post may be raised from the dead by renewed interests (look, how stupid this comment was!), and now suddenly it's flooded by new posts, leaving no trace of how it used to look like.

(2) The "reader/user interface" cannot be archived. You must have seen something like this: You changed the website frontend, superficially, lots of "conservative" users complained, but the point is: now the old frontend and its "look-and-feel" is lost. If it was a simple CSS file, there are chances to bring it back, but if it was a major rewrite of frontend code, now history is gone forever. And in the lifetime of a website, the design and architecture is likely to be changed many times.

As a result, even if a website and all its content is still alive, it may already be a shadow of its past for a long time, don't even mention to preserve it! And currently, there are two ways to archive the web, both are flawed:

(1) Preserve the HTML at the surface. It's good for single pages, but you cannot browse a website in this way at all. None of the button on the website would work.

(2) Preserve the database. For example, using the API to save posts, or dumping the database - the frontend and reader are not preserved. Using Hacker News as an example, now every single post is archived, but it's far from a full experience, at least you should be able to click someone's username and see all the posts.

Now more and more websites are powered by JavaScript, makes the problem even worse. You are now literally running a program on your computer without any control over it. Once the platform is gone, no archive can save you.

What is the solution? I guess there's no full solution, but there are some possibilities:

(1) Wikipedia-like websites already have builtin version control, but it's very difficult to browse the historical version of the entire website. Systems like this can improve the frontend / user interface to allow a user to "lock on" a historical date.

(2) When building an all-Javascript website, spend some energy to build a plain HTML version as well, it may help avoiding the upcoming digital dark age.

(3) If you are going to close a website, perhaps it may be a good idea to make your internal backups of database and codebase at different years publicly available with sensitive information removed, and allow everyone to setup and run a replicated version. It's infeasible for a big website, but it may be a workable idea for a small community.

And I can imagine the archeologist from the 22th century digging into the old backup tapes of Reddit and attempt to rerun the system.

But ultimately, it's a problem that is needed to be addressed by the protocol and software with archive and preservation in mind.

---

BTW: A few weeks ago, I've written a lengthy comment on the fundamental conflict of history preservation and personal privacy, using the Usenet as an example, you may find it interesting.

* https://news.ycombinator.com/item?id=19562650


And that s a great thing! There is no reason to maintain everything, in fact the entire function of our brains is to filter out useful information from a cataclysm of sensory input. The internet figures out what to keep and what to throw away, the hard thing seems to be to willingly make it forget stuff.


If it is rotting, then it will be a great time for scavengers and carrion feeders--maybe even the 'era of'.


I don't care. In fact, I prefer it this way. Death in species is an evolutionary advantage. So too with culture. We mustn't let it ossify.

For the whitewashing thing, it will happen anyway. Only vigilance can protect against rewriting. Websites can be altered. There is no provenance.

I'm not convinced infinite recall is useful.


I don't think it's for us to decide.

We've been mourning the loss of the Library of Alexandria for circa 1,500 years and I can't notice that we're getting radically wiser during the last 25 ones.

Letting all go is definitely the cheaper and convenient attitude... for us. But we might be leaving nothing to build upon for the future generation. They should have the same opportunity to ignore what they want that we have.


And this is an important era in the history of our species. We just invented an almost-free communication medium capable of communication across the globe. This is as big as the discovery of fire.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: