Analysing the Google Panda / Content Farmer update

  • 0
  • March 5, 2011
Patrick Altoft

Patrick Altoft

Director of Strategy

Last week Google announced a major update which affects around 11.8% of all search queries on Google.com. The update was intended to reduce rankings for low-quality sites commonly known as content farms.

As yet the update has yet to hit the UK however as most of the larger content farm type sites are US based it’s probably not going to have as big an impact over here.

Whilst this update is affecting 11.8% of all queries it’s not actually affecting very many sites. The algorithm is being applied at a domain level rather than a specific page level and the sites affected are so large that even a few hundred sites suffering could cause 11.8% of all queries to be altered.

I will come to our conclusions about how this algorithm works later on but first we need to understand why Google would need an extra algorithm to combat these type of sites. Traditionally Google would use a combination of link data and on-site optimisation to judge rankings but for content farms this is impossible for a number of reasons.

Firstly these sites have been around for years, think of the article directories like ezinearticles.com and also content sites such as eHow.com which have been around perhaps 10 years or so. Secondly they have huge volumes of links – ehow.com has 83,000, ezinearticles.com has 206,000 and articlesbase.com has 120,000. Compare this to the BBC with around 500,000 linking domains & the Daily Mail has 128,000 and you can see that Google would really struggle not to trust these sites based on link data.

Finally these sites are super optimised for the long tail of search. They have thousands of users uploading keyword rich content every day and were previously doing a very good job of getting large amounts of traffic. Ezinearticles admitted they could lose half of their 57 million monthly unique visitors with this update, articlesbase.com was recently featured on Techcrunch as a site with 20 million monthly visitors.
Panda update

How the algorithm works

There has been a lot of talk about who has been affected and how the algorithm works but from what I’ve read there are a lot of misconceptions and theories which just don’t stack up for me.

In an interview with Wired, Matt Cutts & Amit Singhal go into more detail about the algorithm and leave some key comments:

Wired.com: How do you recognize a shallow-content site? Do you have to wind up defining low quality content?

Singhal: That’s a very, very hard problem that we haven’t solved, and it’s an ongoing evolution how to solve that problem. We wanted to keep it strictly scientific, so we used our standard evaluation system that we’ve developed, where we basically sent out documents to outside testers. Then we asked the raters questions like: “Would you be comfortable giving this site your credit card? Would you be comfortable giving medicine prescribed by this site to your kids?”

Cutts: There was an engineer who came up with a rigorous set of questions, everything from. “Do you consider this site to be authoritative? Would it be okay if this was in a magazine? Does this site have excessive ads?” Questions along those lines.

Singhal: And based on that, we basically formed some definition of what could be considered low quality. In addition, we launched the Chrome Site Blocker [allowing users to specify sites they wanted blocked from their search results] earlier , and we didn’t use that data in this change. However, we compared and it was 84 percent overlap [between sites downloaded by the Chrome blocker and downgraded by the update]. So that said that we were in the right direction.

Wired.com: But how do you implement that algorithmically?

Cutts: I think you look for signals that recreate that same intuition, that same experience that you have as an engineer and that users have. Whenever we look at the most blocked sites, it did match our intuition and experience, but the key is, you also have your experience of the sorts of sites that are going to be adding value for users versus not adding value for users. And we actually came up with a classifier to say, okay, IRS or Wikipedia or New York Times is over on this side, and the low-quality sites are over on this side. And you can really see mathematical reasons …

The theory put forward by Tom Critchlow is that by making a site look nicer, moving ads below the fold and generally making it appear more trusted content farms can optimise their way out of this algorithm. This seems a bit too easy for me – Google is full of clever people and the idea behind this algorithm is to stop low quality content from attracting traffic in the large volumes it currently does. If this algorithm is to work then the only way the sites can get traffic back is to rewrite large volumes of content.

Google has stated that they are using quality raters to classify low quality sites so it’s easy to jump to the conclusion that it’s the human reviewers who are deciding the results.

My understanding of this algorithm is that the human data is being used in just the same way as the recent Chrome website blocker data – purely as a tool for machine learning and testing of the algorithm. The breakthrough that Singhal talks about from a Googler named Panda is undoubtedly related to the paper he published recently detailing how they could judge quality based on CTR and bounce rate on ads.

My theory is that Google has used human reviewers (and now the Chrome data) to build some kind of machine learning algorithm which looks at the characteristics of all the websites being labelled as low quality and then figures out commonalities and creates an algorithm based around factors such as CTR, bounce rate and perhaps CTR on ads as well.

Another factor Google is likely to be looking at is the bounce rate in comparison not only to other sites ranking for the same query but the bounce rate and time on page in relation to how good a match the site was for the query. If you land on ezinearticles and the page covers exactly what you were looking for then if the article is good you will probably read it all, if not then you will bounce straight back to Google or click an ad. An article really well matched to the query should have good metrics, if not then it’s a sign of low quality.

This algorithm can then be tested time and time again in conjunction with the data gained from human reviews and the Chrome plugin.

I would guess that sites with low quality content are likely to have the same CTR, read rates and bounce rate characteristics whether they have good design or not so it’s going to be very hard for sites to design their way out of this.

Readability

My theory on why ehow.com has not been hit is simple – their content is a lot more readable. Look at the readability of this article compared to this one and this one.

Factors such as whitespace, narrower column width, images, sub-headings etc all contribute to increasing the number of people who read through to the end of the article. I always read to the end of an ehow article no matter how good it is but I’ve never read to the end of something on ezinearticles.

In summary, there is a chance that if you can improve the number of people reading to the end of an article significantly (across a large percentage of your pages) then you could lessen the effects of this algorithm. This sounds the same as Tom’s theory but I think his point was that making the site more trusted would make it appeal more to human reviewers while I think that they are just used for building & testing the machine learning part of the new “readability” algorithm.

Free of charge. Unsubscribe anytime.