Search the history of over 306 billion web pages on the Internet.

Featured

All Texts This Just In Smithsonian Libraries FEDLINK (US) Genealogy Lincoln Collection Additional Collections

eBooks & Texts

Top

American Libraries Canadian Libraries Universal Library Community Texts Project Gutenberg Biodiversity Heritage Library Children's Library

Open Library

arXiv.org Bulk Content

Featured

All Video This Just In Prelinger Archives Democracy Now! Occupy Wall Street TV NSA Clip Library

TV News

Top

Animation & Cartoons Arts & Music Community Video Computers & Technology Cultural & Academic Films Ephemeral Films Movies

Understanding 9/11

News & Public Affairs Spirituality & Religion Sports Videos Television Videogame Videos Vlogs Youth Media

Featured

All Audio This Just In Grateful Dead Netlabels Old Time Radio 78 RPMs and Cylinder Recordings

Live Music Archive

Top

Audio Books & Poetry Community Audio Computers & Technology Music, Arts & Culture News & Public Affairs Non-English Audio Podcasts

Librivox Free Audiobook

Radio Programs Spirituality & Religion

Featured

All Software This Just In Old School Emulation MS-DOS Games Historical Software Classic PC Games Software Library

Internet Arcade

Top

Community Software MS-DOS CD-ROM Software Software Sites Tucows Software Library APK Shareware CD-ROMs

Console Living Room

Vintage Software DOOM Level CD ZX Spectrum Vectrex ZX Spectrum Library: Games Atari 2600 Magnavox Odyssey 2

Featured

All Image This Just In Flickr Commons Occupy Wall Street Flickr Cover Art USGS Maps

Metropolitan Museum

Top

NASA Images Solar System Collection Ames Research Center

Brooklyn Museum

Reply to this post | Go Back
View Post [edit]

Poster: Nemo_bis Date: Jul 7, 2014 10:37am

Forum: faqs Subject: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Reply to this post
Reply [edit]

Poster: metaeducation Date: Mar 24, 2016 11:23am

Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

> When you ask more access, have you first asked
> yourself if *you* would pay for additional legal costs
> it may happen to cause?

There are various entities I'd hope would be willing to get in the fight if someone were to sue (the EFF, to name one).

Either way, it would seem there should be a way to irrevocably greenlight the Internet Archive on content. A license on the content can already do this.

For instance a Creative Commons license: if my blog is entirely CC-BY-SA content, then shouldn't the archive be able to keep it up regardless of some hypothetical later state of robots.txt? There could be something more selective, a "Internet Archive License", so even otherwise copyrighted sites could greenlight the archive having a copy.

If it has to be an opt-in process, then that's unfortunate. But I'd certainly prefer to be able to "opt-in to future domain squatters not being able to erase my existence" over having no choice at all...

Reply to this post
Reply [edit]

Poster: Hjulle Date: Mar 4, 2015 12:50am

Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

None of the documents you refer to says that a new owner should be allowed to remove the old owners content from the Internet Archive. Allowing that does not make any sense, but it's still the way it works right now.

This will also become a growing problem, as more and more webmasters die (or otherwise become unable to pay for their domain). If the domain switches owner, the new owner should not have any power over the old owners content.

Reply to this post
Reply [edit]

Poster: Hjulle Date: Mar 4, 2015 12:54am

Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

This page http://www2.sims.berkeley.edu/research/conferences/aps/removal-policy.html doesn't even say that the bot obays "User-agent: *" (it requires "User-agent: ia_archiver"). So that is a second way in which it is more restrictive than the Oakland Archive Policy requires.

A reasonable compromise would be to make "User-agent: *" only affect the current version, and make "User-agent: ia_archiver" retroactive. That way, you wouldn't remove history by mistake, but you could still remove it just as easily and you wouldn't have to change any of the policy documents.

Also note that "The Robot Exclusion Standard does not mention anything about the "*" character in the Disallow: statement. " - https://en.wikipedia.org/wiki/Robots_exclusion_standard#Universal_.22.2A.22_match

Reply to this post
Reply [edit]

Poster: CogDogBlog Date: Jun 28, 2016 11:55am

Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Fourteen years worth of my early web work in education (1993-2006) have vanished from the archive, reportedly because of robots.txt. However, it's not an inclusion or exclusion problem, but because of some IT person mangled a DNS forwarding entry, and the domain for the archive does not connect to anything.

So if robots.txt is not found at all, the IA wipes it out? Hardly archival to my simple mind. The full story http://cogdogblog.com/2016/06/dont-archive/

Reply to this post
Reply [edit]

Poster: Nemo_bis Date: Mar 4, 2015 1:33am

Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Your interpretation that "*" does not (or should not) imply "ia_archiver" for the sake of the Oakland Archive Policy is an interesting one, but let me say it's a bit adventurous. That might be a way out legally speaking, but it's not self-evident.

Just think of all the emails or support requests which might com from webmasters confused by the (non) interpretation of "*": increasing the workload like that would defeat the purpose. I can understand why IA prefers a conservative (customary?) interpretation for now and I trust them to switch to a less defensive interpretation whenever that's more sustainable than the opposite.

Reply to this post
Reply [edit]

Poster: Menelmacar Date: Apr 2, 2015 4:02pm

Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

"Just think of all the emails or support requests which might com from webmasters confused by the (non) interpretation of "*": increasing the workload like that would defeat the purpose. I can understand why IA prefers a conservative (customary?) interpretation for now and I trust them to switch to a less defensive interpretation whenever that's more sustainable than the opposite."

That's the thing: There's nothing customary about it. The robots.txt standard was invented to affect the *current* behavior of crawlers. Stopping/limiting current crawling was all it ever was ever drafted to do. As far as I've seen, it was never proposed that compliant robots would be expected to perform actions elsewhere, such as modifying existing databases.

See:
http://www.robotstxt.org/orig.html
http://www.robotstxt.org/norobots-rfc.txt
http://en.wikipedia.org/wiki/Robots.txt

The "Oakland Archive Policy" that IA defers to ( http://www2.sims.berkeley.edu/research/conferences/aps/removal-policy.html ) tries to use robots.txt for a purpose it was never designed for. It's a Band-Aid for the fact that there never was (and likely never will be, given the legal tangles involved) a dedicated mechanism for sites to declare whether it's ok for archiving sites to retain permanent copies.

For it's part, robots.txt was never even approved by a major standards body as a standard. It's only a de facto one, which one would think (note: IANAL) might make its use in a legal context even more problematic.

It's unfortunate that there hasn't (to my knowledge) been enshrined into a law protection similar to what exists for temporary caching ( http://en.wikipedia.org/wiki/Online_Copyright_Infringement_Liability_Limitation_Act#Other_safe_harbor_provisions ) , for cases where Internet archiving is provided to the public in an essentially unmodified form for no profit. Given the immense value of a resource like IA to society, ideally something would be worked out to put a site like IA on safer footing.

I think the long and the short of the problem is that IA doesn't have the legal staff, legislated liability protection, or access to standardized authorization protocols that would put them on safer legal ground, nor enough staff to handle enormous volumes of takedown requests, so they feel like they have to go to enormous lengths to be cautious.

I do wish they could at least correlate it against whois records though. My heart sinks any time this happens. It'll definitely become a worse and worse problem as time goes on.

*Sigh* One more reason to loathe %*&^$*ing domain squatting. (Sorry, "domain parking". Ugh.)

Reply to this post
Reply [edit]

Poster: Nemo_bis Date: Apr 2, 2015 11:32pm

Forum: faqs Subject: Re: Customary syntax and liability

As for customary, I *only* meant the usage of "*" as wildcard.

As for legal protection, you're very right. I wonder if https://www.manilaprinciples.org/ would help.

Reply to this post
Reply [edit]

Poster: Hjulle Date: Mar 4, 2015 1:46am

Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

It is not conservative to retroactively remove all content based on a "User-agent: *". And no document mentions that being a valid operation in robots.txt (specifically, the syntax for removing the archive is very explicit). People would expect robots not to crawl sites with "User-agent: *", the wouldn't expect them to remove the archive for them.

But according to https://archive.org/post/423432/domainsponsorcom-erasing-prior-archived-copies-of-135000-domains
they already do that. Only the "User-agent: ia_archiver" should remove anything, so my point was irrelevant.

I drew my first conclusion from this site https://web.archive.org/web/*/http://www.testblogpleaseignore.com/2012/06/22/the-trouble-with-frp-and-laziness/ not having any archive, while the (new) robots.txt only says "User-agent: *".

Poster:	Nemo_bis	Date:	Jul 7, 2014 10:37am
Forum:	faqs	Subject:	Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Poster:	metaeducation	Date:	Mar 24, 2016 11:23am
Forum:	faqs	Subject:	Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Poster:	Hjulle	Date:	Mar 4, 2015 12:50am
Forum:	faqs	Subject:	Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Poster:	CogDogBlog	Date:	Jun 28, 2016 11:55am
Forum:	faqs	Subject:	Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Poster:	Menelmacar	Date:	Apr 2, 2015 4:02pm
Forum:	faqs	Subject:	Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Featured

Top

Featured

Top

Featured

Top

Featured

Top

Featured

Top

Reply to this post | Go Back View Post [edit]

Poster: Nemo_bis Date: Jul 7, 2014 10:37am Forum: faqs Subject: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Reply to this post Reply [edit]

Poster: metaeducation Date: Mar 24, 2016 11:23am Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Reply to this post Reply [edit]

Poster: Hjulle Date: Mar 4, 2015 12:50am Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Reply to this post Reply [edit]

Poster: Hjulle Date: Mar 4, 2015 12:54am Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Reply to this post Reply [edit]

Poster: CogDogBlog Date: Jun 28, 2016 11:55am Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Reply to this post Reply [edit]

Poster: Nemo_bis Date: Mar 4, 2015 1:33am Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Reply to this post Reply [edit]

Poster: Menelmacar Date: Apr 2, 2015 4:02pm Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Reply to this post Reply [edit]

Poster: Nemo_bis Date: Apr 2, 2015 11:32pm Forum: faqs Subject: Re: Customary syntax and liability

Reply to this post Reply [edit]

Poster: Hjulle Date: Mar 4, 2015 1:46am Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Reply to this post | Go Back
View Post [edit]

Poster: Nemo_bis Date: Jul 7, 2014 10:37am

Forum: faqs Subject: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Reply to this post
Reply [edit]

Poster: metaeducation Date: Mar 24, 2016 11:23am

Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Reply to this post
Reply [edit]

Poster: Hjulle Date: Mar 4, 2015 12:50am

Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Reply to this post
Reply [edit]

Poster: Hjulle Date: Mar 4, 2015 12:54am

Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Reply to this post
Reply [edit]

Poster: CogDogBlog Date: Jun 28, 2016 11:55am

Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Reply to this post
Reply [edit]

Poster: Nemo_bis Date: Mar 4, 2015 1:33am

Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Reply to this post
Reply [edit]

Poster: Menelmacar Date: Apr 2, 2015 4:02pm

Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy

Reply to this post
Reply [edit]

Poster: Nemo_bis Date: Apr 2, 2015 11:32pm

Forum: faqs Subject: Re: Customary syntax and liability

Reply to this post
Reply [edit]

Poster: Hjulle Date: Mar 4, 2015 1:46am

Forum: faqs Subject: Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy