Poster:
|
Nemo_bis |
Date:
|
Jul 7, 2014 10:37am |
Forum:
|
faqs
|
Subject:
|
Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
There were countless discussions of this, unsurprisingly, but as a simple volunteer I'm surprised of how little constructive criticism they contained.
The matter is well known:
https://archive.org/about/faqs.php#2https://archive.org/about/exclude.phphttp://www2.sims.berkeley.edu/research/conferences/aps/removal-policy.htmlThe Internet Archive doesn't run for free, it has huge costs. Surprisingly low for the level of service it provides, but still huge. When you ask more access, have you first asked yourself if *you* would pay for additional legal costs it may happen to cause?
Shouldn't we instead be happy that resources have been invested on removing the 6 months embargo and on allowing on-demand archival of URLs, so that now we can immediately enjoy crawls *and* ask our own?
Until the Oakland Archive Policy is superseded, the Internet Archive is not going to change their policies. Is there an alternative standard that one could adopt? If not, who's going to make one? Probably netpreserve.org and IFLA would need to be involved at least.
If you don't like the current policy, work to create one that will do a better service for the public while being a legal defense strong enough to safeguard the Internet Archive...
Some more links for additional instruction:
https://archive.org/post/407088/honoring-present-instead-of-past-robotstxt-is-illogicalhttps://archive.org/post/1009682/archived-pages-should-be-unaffected-by-robotstxt-changeshttps://archive.org/post/1001794/retroactive-and-permanenthttps://archive.org/post/433848/domain-resellers-blocking-waybackmachinehttps://archive.org/post/225623/retroactive-robotstxthttps://archive.org/post/188806/retroactive-robotstxt-and-domain-squattershttps://archive.org/post/184024/robotstxt-policy-is-a-failurehttps://archive.org/post/62230/retroactive-robotstxt-exclusion-different-domain-ownerhttps://archive.org/post/8920/cybersquatters-copyright-ownershiphttps://archive.org/post/602721/remove-archived-webpages-when-domain-was-in-hands-of-previous-ownerhttps://archive.org/post/557165/will-past-crawls-stay-removed-after-removing-robotstxthttps://archive.org/post/423432/domainsponsorcom-erasing-prior-archived-copies-of-135000-domainshttps://archive.org/post/401162/parked-domains-robotstxt-disallows-viewing-of-past-contenthttps://archive.org/post/406315/archived-sites-being-made-no-longer-available-due-to-current-robotstxthttps://archive.org/post/280486/domain-name-re-sold-robots-problem
Poster:
|
metaeducation |
Date:
|
Mar 24, 2016 11:23am |
Forum:
|
faqs
|
Subject:
|
Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
> When you ask more access, have you first asked
> yourself if *you* would pay for additional legal costs
> it may happen to cause?
There are various entities I'd hope would be willing to get in the fight if someone were to sue (the EFF, to name one).
Either way, it would seem there should be a way to irrevocably greenlight the Internet Archive on content. A license on the content can already do this.
For instance a Creative Commons license: if my blog is entirely CC-BY-SA content, then shouldn't the archive be able to keep it up regardless of some hypothetical later state of robots.txt? There could be something more selective, a "Internet Archive License", so even otherwise copyrighted sites could greenlight the archive having a copy.
If it has to be an opt-in process, then that's unfortunate. But I'd certainly prefer to be able to "opt-in to future domain squatters not being able to erase my existence" over having no choice at all...
Poster:
|
Hjulle |
Date:
|
Mar 4, 2015 12:50am |
Forum:
|
faqs
|
Subject:
|
Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
None of the documents you refer to says that a new owner should be allowed to remove the old owners content from the Internet Archive. Allowing that does not make any sense, but it's still the way it works right now.
This will also become a growing problem, as more and more webmasters die (or otherwise become unable to pay for their domain). If the domain switches owner, the new owner should not have any power over the old owners content.
Poster:
|
Hjulle |
Date:
|
Mar 4, 2015 12:54am |
Forum:
|
faqs
|
Subject:
|
Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
This page
http://www2.sims.berkeley.edu/research/conferences/aps/removal-policy.html doesn't even say that the bot obays "User-agent: *" (it requires "User-agent: ia_archiver"). So that is a second way in which it is more restrictive than the Oakland Archive Policy requires.
A reasonable compromise would be to make "User-agent: *" only affect the current version, and make "User-agent: ia_archiver" retroactive. That way, you wouldn't remove history by mistake, but you could still remove it just as easily and you wouldn't have to change any of the policy documents.
Also note that "The Robot Exclusion Standard does not mention anything about the "*" character in the Disallow: statement. " -
https://en.wikipedia.org/wiki/Robots_exclusion_standard#Universal_.22.2A.22_match
Poster:
|
CogDogBlog |
Date:
|
Jun 28, 2016 11:55am |
Forum:
|
faqs
|
Subject:
|
Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
Fourteen years worth of my early web work in education (1993-2006) have vanished from the archive, reportedly because of robots.txt. However, it's not an inclusion or exclusion problem, but because of some IT person mangled a DNS forwarding entry, and the domain for the archive does not connect to anything.
So if robots.txt is not found at all, the IA wipes it out? Hardly archival to my simple mind. The full story
http://cogdogblog.com/2016/06/dont-archive/
Poster:
|
Nemo_bis |
Date:
|
Mar 4, 2015 1:33am |
Forum:
|
faqs
|
Subject:
|
Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
Your interpretation that "*" does not (or should not) imply "ia_archiver" for the sake of the Oakland Archive Policy is an interesting one, but let me say it's a bit adventurous. That might be a way out legally speaking, but it's not self-evident.
Just think of all the emails or support requests which might com from webmasters confused by the (non) interpretation of "*": increasing the workload like that would defeat the purpose. I can understand why IA prefers a conservative (customary?) interpretation for now and I trust them to switch to a less defensive interpretation whenever that's more sustainable than the opposite.
Poster:
|
Menelmacar |
Date:
|
Apr 2, 2015 4:02pm |
Forum:
|
faqs
|
Subject:
|
Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
"Just think of all the emails or support requests which might com from webmasters confused by the (non) interpretation of "*": increasing the workload like that would defeat the purpose. I can understand why IA prefers a conservative (customary?) interpretation for now and I trust them to switch to a less defensive interpretation whenever that's more sustainable than the opposite."
That's the thing: There's nothing customary about it. The robots.txt standard was invented to affect the *current* behavior of crawlers. Stopping/limiting current crawling was all it ever was ever drafted to do. As far as I've seen, it was never proposed that compliant robots would be expected to perform actions elsewhere, such as modifying existing databases.
See:
http://www.robotstxt.org/orig.htmlhttp://www.robotstxt.org/norobots-rfc.txthttp://en.wikipedia.org/wiki/Robots.txtThe "Oakland Archive Policy" that IA defers to (
http://www2.sims.berkeley.edu/research/conferences/aps/removal-policy.html ) tries to use robots.txt for a purpose it was never designed for. It's a Band-Aid for the fact that there never was (and likely never will be, given the legal tangles involved) a dedicated mechanism for sites to declare whether it's ok for archiving sites to retain permanent copies.
For it's part, robots.txt was never even approved by a major standards body as a standard. It's only a de facto one, which one would think (note: IANAL) might make its use in a legal context even more problematic.
It's unfortunate that there hasn't (to my knowledge) been enshrined into a law protection similar to what exists for temporary caching (
http://en.wikipedia.org/wiki/Online_Copyright_Infringement_Liability_Limitation_Act#Other_safe_harbor_provisions ) , for cases where Internet archiving is provided to the public in an essentially unmodified form for no profit. Given the immense value of a resource like IA to society, ideally something would be worked out to put a site like IA on safer footing.
I think the long and the short of the problem is that IA doesn't have the legal staff, legislated liability protection, or access to standardized authorization protocols that would put them on safer legal ground, nor enough staff to handle enormous volumes of takedown requests, so they feel like they have to go to enormous lengths to be cautious.
I do wish they could at least correlate it against whois records though. My heart sinks any time this happens. It'll definitely become a worse and worse problem as time goes on.
*Sigh* One more reason to loathe %*&^$*ing domain squatting. (Sorry, "domain parking". Ugh.)
Poster:
|
Nemo_bis |
Date:
|
Apr 2, 2015 11:32pm |
Forum:
|
faqs
|
Subject:
|
Re: Customary syntax and liability |
As for customary, I *only* meant the usage of "*" as wildcard.
As for legal protection, you're very right. I wonder if
https://www.manilaprinciples.org/ would help.