Learning from Google Webmaster Tools Caffeine data
Some of you may have noticed that your link counts in Google Webmaster Tools have increased somewhat recently, we are seeing amazing numbers up to 1000 times higher than previous figures.
The reason for this is very interesting, following the roll out of the new Caffeine infrastructure Google is able to spider sites far deeper than before and they are now reporting this increased activity in Webmaster Tools. You can see this by looking at your internal links – if you have 50,000 internal links to your homepage then it’s a fair assumption you have 10,000 pages on your site. This number is probably a lot higher than it was last month.
When you take into account the increase in internal links being reported it’s quite clear why Google is now reporting on a lot more external links too, especially when you think about sitewide links.
This number is here to stay according to Google:
Yes, we revamped the data behind the backlinks feature in Webmaster Tools — it has started using more data from “Caffeine” for some sites and is planned to continue with a bit more data in the next week or so. The goal is to have more fresher & up-to-date data there :-).
Another very interesting piece of data we are seeing is that although Google is reporting a lot more pages within Webmaster Tools the number of pages indexed by the site: query on Google has dropped for quite a few sites we monitor. This drop appears to be the result of Google being better able to determine which pages are worth displaying in the index.
Our observations can be summed up as follows:
- Google is spidering big sites a lot more than before – both in terms of volume of pages and frequency
- Google is much better at deciding which pages are worth showing in the index
- A lot of pages that are spidered are not being indexed if they are low value (see Mayday update)
- Pages that previously were indexed but not ranking are now not indexed (but still being spidered)
One of the main impacts this has had on SEO is that the way people audit websites is now pretty much invalid. In the past people would do a site: query on Google to look through all the indexed pages and find errors and problems from there. This is not accurate anymore because Google is not allowing low value pages to get into the index as much.
We downloaded all our clients internal links (Google lets you have up to around 100MB) and found (using Excel) that there were quite a few rogue pages that were being spidered and counted as internal links but not indexed – these pages were wasting PageRank and diluting the impact of the good pages. We would never have spotted these pages by looking at indexed pages alone.