Google’s duplicate content filter is broken

  • 0
  • February 25, 2008
Patrick Altoft

Patrick Altoft

Director of Strategy

Over the last few weeks Blogstorm has been fighting a battle with scraper sites & the Google duplicate content filter. For some queries, the battle has been well and truly lost.

Try this query to see an example for one of my most popular posts.

Most articles continue to rank highly on Google but a number of them have been filtered from the search results because Google thinks they are duplicates of other pages. Most of the other pages are things like scraper sites or Digg/Sphinn stories – all of which link back to the original post. The articles that have been filtered used to rank very well and they only went missing in the last few weeks.

The duplicate content filter seems to have become a lot more reliant on trust recently and perhaps I am seeing a side effect of this. I would hope that Blogstorm, with over 100,000 natural links and good on-page optimisation, might be a trusted site too but it clearly isn’t there yet.

One theory I have is that the duplicate content filter struggles to allocate the correct source of content that has changed urls. I changed the url structure on Blogstorm in January and since then Google has decided a lot of articles are suddenly duplicate content. One example is the What do people take a picture of first post which used to rank 4th for the term “jpg” and now suddenly is classed as duplicate content and is filtered from the results when you search for the exact terms in the title tag.


I invented this title and for my article to not be on the front page is frustrating to say the least.

If you are wondering how I know that the duplicate content filter is to blame take a look at the screenshot below. The original article has been filtered from a search for “Top 10 worst websites you’ll wish you hadn’t seen” but other pages from Blogstorm that reference the article are still showing up which proves the domain has the authority and relevance to rank. This situation is happening across a load of queries.


The only reason I found these issues is because I was testing some new software that lets you do bulk rank checking based on page titles and I ran it on Blogstorm.

Expecting that most articles would rank first for their own unique titles I was pretty surprised to see that quite a lot didn’t even rank on the first page! I suspect this blog isn’t unique and that millions of pages are being filtered incorrectly without the site owners ever realising.

If you have an example of this for your site then post a link in the comments and hopefully Google might take a closer look at how the duplicate content algorithm is working. Certainly a site like Digg shouldn’t be outranking the story it links to when the original story is on a trusted blog.

Free of charge. Unsubscribe anytime.