1 year ago9 points(+0/-0/+9Score on mirror)1 child
To be fair, some parts of the Wayback Machine/Archive.org are still offline - initially you couldn't access the Wayback Machine at all - so it may be this content exists (or will exist) and they just haven't restored that part of the access so far.
That said, the Internet Archive is an inherently unreliable archival system because it honours robots.txt.
A domain owner can upload a "robots.txt" file telling the archive bot to omit part or all of a site from crawling (so pages a real user could browse to might be missing from the archive). Archive.org has even gone the extra step of honouring this retroactively - so the domain owner can use this to get past content holocausted off of the archive, even if that content was archived before it was excluded in robots.txt.
A true archival system would ignore robots.txt and use every possible trick to detect and bypass bot prevention measures, capturing the complete, accurate internet as a real user would see it.
The moment this recent "attack" ran on for several days, I could tell they're either willfully irresponsible with their data or something fishy is going on. Both options aren't good
Incompetence/budget/etc. is always the safer bet, no nefarious fed scheming required.
Look at the CrowdStrike outage earlier this year. They broke every customer simultaneously - they must have been YOLOing updates for a while now, and they got away with it until one bad update took it all down. Compare with Microsoft using previews, staging, etc. - even if a bad update affects some customers, the rollout halts automatically well before everyone breaks.
Another fun one I remember is [MGM Resorts](https://ww.reddit.com/r/sysadmin/comments/16owgsl/anyone_wanna_help_rebuild_mgm_resorts/), where they went down so badly they decided to start over:
> Anyone Wanna Help Rebuild MGM Resorts? They're looking for a Red Hat admin who will work every day until Humpty Dumpty is rebuilt from scratch.
> 10 hour shifts 7 days a week with people swaping in and out [...] For $100/hr
That said, the Internet Archive is an inherently unreliable archival system because it honours robots.txt.
A domain owner can upload a "robots.txt" file telling the archive bot to omit part or all of a site from crawling (so pages a real user could browse to might be missing from the archive). Archive.org has even gone the extra step of honouring this retroactively - so the domain owner can use this to get past content holocausted off of the archive, even if that content was archived before it was excluded in robots.txt.
A true archival system would ignore robots.txt and use every possible trick to detect and bypass bot prevention measures, capturing the complete, accurate internet as a real user would see it.