Working with News Publishers

Wednesday, July 15, 2009 | 2:18 PM

Labels: ,

Last week, a group of newspaper and magazine publishers signed a declaration stating that "Universal access to websites does not necessarily mean access at no cost," and that they "no longer wish to be forced to give away property without having granted permission."

We agree, and that's how things stand today. The truth is that news publishers, like all other content owners, are in complete control when it comes not only to what content they make available on the web, but also who can access it and at what price. This is the very backbone of the web -- there are many confidential company web sites, university databases, and private files of individuals that cannot be accessed through search engines. If they could, the web would be much less useful.

For more than a decade, search engines have routinely checked for permissions before fetching pages from a web site. Millions of webmasters around the world, including news publishers, use a technical standard known as the Robots Exclusion Protocol (REP) to tell search engines whether or not their sites, or even just a particular web page, can be crawled. Webmasters who do not wish their sites to be indexed can and do use the following two lines to deny permission:

User-agent: *
Disallow: /


If a webmaster wants to stop us from indexing a specific page, he or she can do so by adding '<meta name="googlebot" content="noindex">' to the page. In short, if you don't want to show up in Google search results, it doesn't require more than one or two lines of code. And REP isn't specific to Google; all major search engines honor its commands. We're continuing to talk with the news industry -- and other web publishers -- to develop even more granular ways for them to instruct us on how to use their content. For example, publishers whose material goes into a paid archive after a set period of time can add a simple unavailable_after specification on a page, telling search engines to remove that page from their indexes after a certain date.

Today, more than 25,000 news organizations across the globe make their content available in Google News and other web search engines. They do so because they want their work to be found and read -- Google delivers more than a billion consumer visits to newspaper web sites each month. These visits offer the publishers a business opportunity, the chance to hook a reader with compelling content, to make money with advertisements or to offer online subscriptions. If at any point a web publisher feels as though we're not delivering value to them and wants us to stop indexing their content, they're able to do so quickly and effectively.

Some proposals we've seen from news publishers are well-intentioned, but would fundamentally change -- for the worse -- the way the web works. Our guiding principle is that whatever technical standards we introduce must work for the whole web (big publishers and small), not just for one subset or field. There's a simple reason behind this. The Internet has opened up enormous possibilities for education, learning, and commerce so it's important that search engines makes it easy for those who want to share their content to do so -- while also providing robust controls for those who want to limit access.

Image: 'Robots wallpaper,' Jelene (Creative Commons Attribution)

Update on 7/20/2009: The word "crawling" in the fourth paragraph has been replaced with "indexing."

Posted by Josh Cohen, Senior Business Product Manager

32 comments:

The Asnah's Journey said...

Totally agree! The publisher should wake up and re-consider themselves how to stay with the rest of people who are always 'online' in humble and sustainable way...

stop ego-ing and they will survive from the down side of printed media era... pray for them!

Brendan said...

Very well and politely phrased reaction. I'm interested if anyone from the new industry will respond.

gconte said...

I believe the problem is not Google, search engines, or crawlers; is the content business model that has to be review, adv fragmentation does not cover alone quality content (created by someone who lives of doing so); google will become shortly in the largest content seller when publishers start to have suscriptions or micropyments for content, thats my end game.

Andy Beard said...

You might want to be a little more accurate with your descriptions.

A page blocked with robots.txt can still appear in the SERPs, but Google won't have crawled it.
The snippet would use either link anchor text or DMOZ title for the title, and DMOZ description if available.

Noindex, Google will still crawl the page, and links can confer PageRank to other pages.
The page won't appear in the SERPs.

For Noindex to work, Google has to have access to a page, thus if you mix both the robots.txt disallow directive with meta noindex, Google will obey the robots.txt, and thus can't read the noindex.
The page can still appear in the SERPs, with title from anchor text or DMOZ.

basberkenbosch said...

Reaction in style. I'm eagerly awaiting as tension builds up on what news industry has to say in reply.

stephenlarson said...

The problem is the Robots Exclusion Protocol is far too binary. A publisher must either allow all uses or none.

Those, like the author of this article, that want to think that the solution is so simple are not looking at the problem from all sides.

Denis Gorodetskiy said...

kick'em off, Google!

John Juliano said...

I'm surprised here was no mention of ACAP the standard for robot.txt files that the World Association of Newspapers has been working on with, I understood, google and others.

M said...

Big news media can't have it both ways.

If they want to play in the social media sandbox, they will have to play by social media rules as they evolve.

They don't want to be indexed?

Perfect.

More eyeballs for me.

Maurice Cardinal
Editor: www.OlyBLOG.com

Murilo de Souza Lopes said...

Well Google invades the sites, many people do not know how to block indexing.

Some cases we have seen on the Internet for sites that are hacking even before being launched as the Google index the page without permission.

I believe that the way that Google works in some parts is very failure, including the exploitation of content from other people, we know that small businesses could not grow as much of the bill is intended to advertising within the search engines (where the content is used so wrong).

I can not be unfair and say that Google is a bad company, most believe that the internet goes beyond online advertising, the business model used by Google is no different from old media and believe it will become obsolete as old media.

Besides that if you have good content you are competing with low-quality sites that are only there because they are paying part of the links, even with an entire process of verification within the Adwords quality is still very imperfect, because in the world various searches and some have little content in search engines, where you can see a clear difference in quality of service quality, even if free could have a better way to find the company paying the bills.

bernd said...

Ah, you guys are cool tempered. I would have just removed Burda Media from the index after they complained.

EZ said...

I've been in the newspaper business since 1989. I was on the internet before most newspapers were, before there was even a World Wide Web. I always thought that as soon as publishers started to realize that they were losing money because they had content available for free they would just pull it back.

Some have and some haven't. The NYT has gone back and forth. The Wall Street Journal has always had a pay wall. Neither are doing particularly well.

That's because it's not about the price charged for the content; it's about selling the audience.

Newspapers are in trouble because they forgot how to sell audiences to advertisers.

When I worked as a newspaper circulation executive my goal was to make enough revenue to cover variable costs of printing and distribution, in essence making it free to distribute. Advertising sales had to cover all of the overhead - salaries, benefits, building, maintenance, presses, trucks, news gathering, etc.

This is not unlike TV, radio or even the internet. Radio and TV always gave away content because once the studio was built, the contract with the talent was signed and the transmitter was in place, there was very little cost to deliver the content.

However that content was sponsored so all of those other fixed costs could be paid with enough left over for profit, pension plans and a decent Christmas party.

This is what newspapers need to relearn. They can block Google or any other search engine from indexing their content so they can charge for it, but that's just not going to generate enough money to run the rest of the operation. Newspapers need to engage audiences and sell those audiences to people that want to reach them. Until they start doing that again they will never be profitable. People are just not going to pay enough for content to make up for all the losses in advertising dollars newspapers have seen.

Hiding from Google isn't going to help. Finding new ways to delight and amaze audiences, and proving to advertisers that your audience is delighted, will.

Thomas said...

Looks like a fair solution to me. Media have a choice.

Michael said...

Well, I'm in the news industry (and let me tell you, it's an old industry) and I also agree completely with the above.

That said, it's worth acknowledging that the newspaper/magazine publishers do have a valid argument: Google is more valuable because of their content, and maybe there ought to be a half-measure between "block all bots" and "search engines crawl everything for free."

Maybe Google thinks there's no profit for them between those two extremes. And they're probably right. But let's not pretend traditional publishers are the only ones making a choice here.

codeispoetry said...

@Brendan:
What should the news industry respond to that polite and *right* article? "Sorry, we didn't our homework and didn't read the the f....g manual?"

Two lines of code. As easy as this. Nothing else more. End of discussion.

I think, the news industry has to learn, that they can earn money even on the web, but under new and other conditions than within the last decades. But they have to change, and they have to have the willingness to change. Otherwise, their companies will die.

Yours,
Thomas, Munich, Germany

elmar.thiel said...

There ist a fundamental difference between a) indexing web pages and then directing traffic to them and b) using (parts of) the content on your own sites.

Unfortunately, Googles reply completely disregards this aspect, which IMHO is the true core of the problem.

Elmar Thiel, Hamburg, Germany

analogue said...

Way to go Google !

Gerard said...

"Well Google invades the sites, many people do not know how to block indexing."

So, you're saying that people who know how to build a web site don't know how to block indexing? Sorry but that doesn't make any sense at all.

One of the stalking horses here is typical of Europe. Wanting someone else to pay.

What would make these news sites happy would be if they could force Google to pay them for content they won't keep off the web. A share of google ad money for content.

That's what's up here. The quest for easy money.

Patrick A. Goff said...

But what if you are a publisher of original news and feature stories and Google refuses to carry your content whilst carrying that of your competitor publications? They have an anti-European, pro-American bias. They should either publish all news stories in a subject area or none, and be impartial. Currently Google is myopic and very US oriented. They should be barred from all European News media until they oiperate fairly

chartus said...

let them die in peace and rest in history

ilog2000 said...

A VERY elegant answer! Google search result page is also an aggregator. If anyone doesn't want to be aggregated, ha can easily avoid that.

Igor Loginov

Brent D. Payne said...

Hmmm . . . but what happens when Google ignores a Robots.txt protocol?

This . . .
http://healthkey.com/robots.txt

Versus this . . .
http://www.google.com/search?source=ig&hl=en&rlz=1G1GGLQ_ENUS263&=&q=site%3Ahealthkey.com&aq=f&oq=&aqi=

I see it ALL the time, btw. ;-)

Brent D. Payne
SEO Director
Tribune Company

Julius Beezer said...

I heartily concur with your article. If publishers don't wish their content to be online, there is a simple solution: take it off.

Underlying this, I wish Google would NOT serve links to material that is hidden behind toll access barriers, or at least make a browsing option that hides this from view.

When I use the internet I want FREE information. I know that there is more information out there on a topic published commercially, but if I wanted to buy that I would be looking in a library or bookshop.

It seems to me publishers want it both ways: they want Google to generate demand for their product by placing it in searches, then hide it behind toll access so they can charge for it. This isn't how the internet works: the internet is primarily about about free access to information. If you don't want to play in this game, that's fine: please do go and consign yourself to irrelevance.

L ' Individu said...

Totally agree.
No publisher is forced to have a website, so get out of internet if you don't like it, but don't try to force the others to follow your dead rules. The rules of internet are clear since Sir Berners-Lee created the WWW long time ago, and no businesmen has the right to change them.

meditation said...

your article is great,I found a great make money site.
Best 15 ways to earn money by home,make money to easy ways by home.
http://greatmakemoney.blogspot.com/

Lhasa-Apso said...

the is a lot you can do with robots.txt and specify the dir's wich are index and noindex....

Volker said...

It's not about Google, the name is just used as a synonym for internet by people who can not adapt do new distribution channels.

They claim that their "quality journalism" has to be protected as their source of income. The internet, and especially the search engines, help us consumers to find the source of most news, the big agencies like Reuters! Most of the "quality journalism" is copied word by word from the agencies tickers.
In the days of dead tree publishing we rarely found out.

A lot of the rest of the "quality journalism" is copied from blogs and social websites, often without credit to the author and no payment thrown in as good measure.

I have to subscribe to the printed edition of the local paper so I can subscribe to the online edition at extra cost, this just doesn't make any economic sense! Especially when said paper has not much more to offer than the agency news.

Oh, and they have a sports journalist who used one of my pictures without my permission!

adrin said...

Great one.. Thanks for posting..

Work from home

adrin said...

Great one.. Thanks for posting..

Work from home

Vijay said...

Dear Josh Cohen,

I have a site that I did "protect" from indexing, from the very beginning, with a robots.txt file, i.e.:
User-agent: *
Disallow: /
This robots.txt file is still there. However, my site content is showing up on Google.

Whilst I agree that publishers should protect their content if they don't want Google to index it, it seems to me, at this point, that Google has ignored the robots.txt file. Can you help me figure out how THAT happened and what I can do to have it fixed.

Thank you

rnbresearch said...

Hello I just entered before I have to leave to the airport, it's been very nice to read your post, it is very interesting and very informative. I liked it!!!!!!

chang said...

hi! This template is simply super.... website development