We have a sitemap-news.xml, the google webmaster tools seems to like it, and has recognized it as a news-format sitemap. It's being updated frequently with stories from the last four days (re-checked every minute, I think.) The urls are fine, 3+ digits, doesn't look like a year. One domain involved only. HTML navigation to the relevant pages. The stories are in html format (not javascript-embedded or anything) -- admittedly, we need to tweak the teaser format a bit (working on it..) We have at least 25 stories a day, often many more.
No frames, english language, each articles has a link to it, the main pages never change their own urls, it's not a registration or subscription necessary site. Google web search seems to find many, many pages, just fine. The robots.txt file has nothing exotic in it, and does not specify any user-agent by name (so presumably what google web search sees, google news search does.) URLs are canonicalized to imarketnews.com urls.
However, google news seems to have picked up a grand total of one of our articles, and no others. site:imarketnews.com search reveals so. Google webmaster tools doesn't point at any crawl errors, and has a healthy crawl total. Google Webmaster tools isn't showing anything of consequence in the diagnostics.
We bring it to the news team's attention and we get quoted back the standard boilerplate (a unique url for each articles full text, a url with a unique number, a fixed main page url, and html links) -- which I understand, they're busy, and they can't help everyone, but it seems to us that the site complies with all those requirements, none of which are particularly out of the ordinary anyway -- it's just not posting to the google news site. We've reviewed the technical requirements at http://www.google.com/support/news_pub/bin/topic.py?hl=en&topic=11665 and complied with all of them too, that we can tell.
Your article content appears to consist only of isolated sentences
not grouped into paragraphs, therefore, we won't be able to crawl it. Try
formatting your articles into text paragraphs of a few sentences each.
I will have this looked into. I think I will have to have this done programmatically on the site, but I appreciate the hint and will let you know how it goes.
We're working out the how on having more stories get news crawled. I see in documentation that under google webmaster tools, diagnostics, there can be a 'news crawl' errors, to look for news-specific problems. I am apprised that we've been approved for inclusion in the news, but I do not have the news crawl option under my diagnostics. Is 'news crawl' or 'news crawl errors' under Diagnostics in the webmaster tools no longer available? If it is available, how can my webmaster tools get included so that I can see why specific stories aren't getting news crawled? It would make it much easier to figure out these problems for ourselves, if it's available to us/me.
Okay, thank you, there it is, much appreciated. I sort of suspected that menu had been subsumed -- didn't much make sense to have each type of crawl error have a second order menu, so yeah. My eyes still aren't finding the right places in the menus. No news-specific errors, anyway, though, so the team's still working away at it.. :)
Okay, I'm seeing no sentences errors for a lot of pages. The information sources from a feed of text that is usually 60-70 characters across, and often wants that formatting (that is to say that the line-breaks should appear where the source information wants it) rather than what we'd acknowledge is the standard policy for HTML which is.. let the much smarter than all of us browsers perform appropriate wrapping themselves.
Sooo.. my suspicion is that our attempt to control the line-breaks is what makes the news crawler quite unhappy -- is there a more news crawl friendly way to present those breaks and not have it get quite so unhappy? Or do we have to pretty much have to consider trying to live without source-controlled line breaks?