Investors in United Airlines saw their stock value drop sharply when Google News published a Chicago Tribune / Sun Sentinel story announcing the company was filing for bankruptcy. The story was displayed by Google News as a new story even though it was actually published in 2002.
The Tribune is frantically trying to pass the blame onto Google and has released a press release which has been picked up by several news websites such as the LA Times and PC World. Google has released a blog post saying that they did nothing wrong:
On Saturday, September 6th at 10:36 PM Pacific Daylight Time (or Sunday, September 7th at 1:36 AM Eastern Daylight Time), the Google crawler detected a new link on the Florida Sun-Sentinel’s website in a section of the most viewed stories labeled “Popular Stories: Business.” The link had newly appeared in that section since the last time Google News’ Googlebot webcrawler had visited the page (nineteen minutes earlier), so the crawler followed the link and found an article titled “UAL Files for Bankruptcy.” The article failed to include a standard newspaper article dateline, but the Sun-Sentinel page had a fresh date above the article on the top of the page of “September 7, 2008” (Eastern).
Because the Sun-Sentinel included a link to the story in its “Popular Stories” section, and provided a date on the article page of September 7, 2008, the Google News algorithm indexed it as a new story. We removed this story as soon as we were notified that it was posted in error.
The Tribune has this to say:
Our records show that due to traffic volume, sometime between 1:00:34 a.m. EDT, Sunday, September 7 (10:00:34 p.m. PDT, Saturday, September 6) and 1:36:03 a.m. EDT, Sunday, September 7 (10:36:03 p.m. PDT, Saturday, September 6), a link to the old article appeared in a dynamic portion of the Sun Sentinel’s business section, grouped with other stories under a tab entitled “Popular Stories Business: Most Viewed.” No new story was published and the old story was not re-published — a link to the old story was merely provided.
Importantly, the URL for the old story did not change when the link appeared on the website’s business section.
At 1:36:57 a.m. EDT, September 7, (10:36:57 p.m. PDT, September 6), our records show that the Google search agent — known as “Googlebot” — crawled the story on Sun Sentinel’s website. Our records also show that the Google search agent had previously crawled this same story numerous times, including as recently as last week. Shortly after Googlebot crawled the Sun Sentinel site this time, however, a link to the story appeared on Google News, with a date of Sept. 6, 2008, provided by Google. At 1:39:59 a.m. EDT, September 7 (10:39:59 p.m. PDT, September 6), our records show the story on the Sun Sentinel website received its first referral from Google News.
What actually happened
The Tribune has stated that the url of the story didn’t change but Google says the url was brand new and had not been crawled before so who is right? Looking at the search results below I could find the exact same article listed at 8 separate urls.
The link posted to the “Popular Stories” section was to this article (now removed) which had a different url to all the other stories indexed by Google so Google thought it was a brand new article. In Googles world if something has a brand new url then it’s a brand new page. Perhaps the url has existed for 6 years without being found by Google – if Google has indexed the story at a number of very similar urls then they may well have decided not to crawl the extra url. When this suddenly gets linked from thousands of other pages on the site Google probably thought it was important enough to crawl and index.
Why was this story listed in the popular stories section
The big mystery that nobody has picked up on is how a 6 year old story was listed in the “Popular Stories” section in the first place. Assuming it wasn’t a software glitch then my assumption is that either there was a sudden increase in search volume for a particular keyword or the story somehow started to gain traffic from social media.
If I was to heavily promote an old story (one that didn’t display the publishing date prominently) using Digg and StumbleUpon then it could quite easily get a lot of traffic and push it into the “Popular Stories” section of a major website.
According to the WSJ just one visitor was enough to push the story into the Popular Stories section:
In its latest explanation, Tribune said a single visit during a low-traffic period early Sunday morning pushed the undated story onto the list of most popular business news of its South Florida Sun-Sentinel newspaper’s Web site.
About 30 minutes after that visit, a user viewing a story about airline-cancellation policies during a storm-ravaged weekend clicked on the link for the old story.
Did the Tribune ask Google to stop crawling them?
According to newspaper reports the Tribune is claiming that they asked Google to stop crawling them, Google denies the claims so who is right?
Tribune said it asked Google “months ago” to stop using Googlebot to crawl its Web sites after it identified problems with the program. But Google denied such a request was ever made.
“The claim that the Tribune Company asked Google to stop crawling its newspaper Web sites is untrue,” it said.
Looking at the robots.txt file, which is the method every other website uses to prevent or control Google’s spider we have to conclude that the Tribune didn’t ask Google to stop crawling them. Perhaps they sent an email or gave them a call, clearly whatever they did wasn’t effective.
Could the Tribune face legal action?
If my analysis is correct and the Tribune did publish the story at multiple urls and the stories didn’t have the correct datestamp then my belief is they have been negligent. In the era of social media where 100,000 people can be directed to an article within hours publishing without a valid time and date at the top of the article is highly irresponsible.
The fact that the CMS has been designed without considering the potential implications of multiple urls and how Google News might handle them is also highly irresponsible for a major media organisation.
Unless news sites realise the importance of these issues we will see this type of incident happen again and again.