You are viewing a single comment's thread. View all
9
MrBaptist on scored.co
1 year ago9 points(+0/-0/+9Score on mirror)1 child
To be fair, some parts of the Wayback Machine/Archive.org are still offline - initially you couldn't access the Wayback Machine at all - so it may be this content exists (or will exist) and they just haven't restored that part of the access so far.
That said, the Internet Archive is an inherently unreliable archival system because it honours robots.txt.
A domain owner can upload a "robots.txt" file telling the archive bot to omit part or all of a site from crawling (so pages a real user could browse to might be missing from the archive). Archive.org has even gone the extra step of honouring this retroactively - so the domain owner can use this to get past content holocausted off of the archive, even if that content was archived before it was excluded in robots.txt.
A true archival system would ignore robots.txt and use every possible trick to detect and bypass bot prevention measures, capturing the complete, accurate internet as a real user would see it.
That said, the Internet Archive is an inherently unreliable archival system because it honours robots.txt.
A domain owner can upload a "robots.txt" file telling the archive bot to omit part or all of a site from crawling (so pages a real user could browse to might be missing from the archive). Archive.org has even gone the extra step of honouring this retroactively - so the domain owner can use this to get past content holocausted off of the archive, even if that content was archived before it was excluded in robots.txt.
A true archival system would ignore robots.txt and use every possible trick to detect and bypass bot prevention measures, capturing the complete, accurate internet as a real user would see it.