Google’s duplicate content filter is broken

  • 0
  • February 25, 2008

Over the last few weeks Blogstorm has been fighting a battle with scraper sites & the Google duplicate content filter. For some queries, the battle has been well and truly lost.

Try this query to see an example for one of my most popular posts.

Most articles continue to rank highly on Google but a number of them have been filtered from the search results because Google thinks they are duplicates of other pages. Most of the other pages are things like scraper sites or Digg/Sphinn stories – all of which link back to the original post. The articles that have been filtered used to rank very well and they only went missing in the last few weeks.

The duplicate content filter seems to have become a lot more reliant on trust recently and perhaps I am seeing a side effect of this. I would hope that Blogstorm, with over 100,000 natural links and good on-page optimisation, might be a trusted site too but it clearly isn’t there yet.

One theory I have is that the duplicate content filter struggles to allocate the correct source of content that has changed urls. I changed the url structure on Blogstorm in January and since then Google has decided a lot of articles are suddenly duplicate content. One example is the What do people take a picture of first post which used to rank 4th for the term “jpg” and now suddenly is classed as duplicate content and is filtered from the results when you search for the exact terms in the title tag.


I invented this title and for my article to not be on the front page is frustrating to say the least.

Thanks to Tim at web development leeds for tipping me off about this.

If you are wondering how I know that the duplicate content filter is to blame take a look at the screenshot below. The original article has been filtered from a search for “Top 10 worst websites you’ll wish you hadn’t seen” but other pages from Blogstorm that reference the article are still showing up which proves the domain has the authority and relevance to rank. This situation is happening across a load of queries.


The only reason I found these issues is because I was testing some new software that lets you do bulk rank checking based on page titles and I ran it on Blogstorm.

Expecting that most articles would rank first for their own unique titles I was pretty surprised to see that quite a lot didn’t even rank on the first page! I suspect this blog isn’t unique and that millions of pages are being filtered incorrectly without the site owners ever realising.

If you have an example of this for your site then post a link in the comments and hopefully Google might take a closer look at how the duplicate content algorithm is working. Certainly a site like Digg shouldn’t be outranking the story it links to when the original story is on a trusted blog.

Patrick Altoft

About Patrick Altoft

Patrick is the Director of Strategy at Branded3 and has spent the last 11 years working on the SEO strategies of some of the UK's largest brands. Patrick’s SEO knowledge and experience is highly regarded by many, and he’s regularly invited to speak at the world’s biggest search conferences and events.

  • Syam

    I am no expert but I guess there are enough crawlers who keeps linking to popular articles from reddit,digg and sphinn .. effectively making those pages ranked best. I believe Google should figure out themselves the hub should not be considered as destination.

  • Chetan

    Hope you dont go into a phase like John Chow is doing. Content scrap was one of the reason for getting penalised in google for him.
    I actually just hate the content scraping sites, those who just copy each and every content of my blog and just linking back to my original post.
    In my mind, these kinda sites hurt our quality by having all copied matter, and giving link back showing that we have low quality backlinks from unnatural pages.

  • Sucker

    I did a lot of URL structure changes on one of my sites late in 2007 and had this same problem. A few pages were fixed with 301 redirects but Google is disregarding others (or devaluing them a little.)

  • master_rooter

    I get frustrated too because a well know scrap site is ranking better than i do (me who produces the content but copied by someone else). What will Google (which is a quality content driven search engine) do? Nothing.

  • Patrick Altoft

    Some of the queries appear to be returning better results today, the duplicate content filter seems to jump around a bit.

  • Sean

    Hi Patrick,
    Let me know what you think of this:
    I have a website that ranked for very good keywords and then one day, I messed up the robots.txt file. I forgot to leave an empty line after the last line (a Disallow: line). I read it somewhere that that’s equivalent to telling SE that they cannot crawl any of my pages. After making that mistake, around 3 weeks later, ranking dropped and traffic dropped by 80%! After I did some checking, I fixed the robots.txt file but the ranking still never return. Did further checking and found out that pages on my site has been classified as duplicates. I wrote those articles and now I have been slapped with the filter.

    How I know I have been filtered? I took a few long sentences from my page and did a search on Google with quotes. My page don’t show up, I have to scroll to the last page and then click on “repeat the search with the omitted results included.” for my page to show up. That’s when I know I have been filtered. I have to go through the pages on my site to find those that have been filtered and rewrite them. After that, the ranking came back!

    It’s really worrying seeing an authority site can get slapped and mistaken as the copy cat. I hope G only made that mistake because I mess up something on my site like the robots.txt file. In your case, you changed the url structure and I guess that made G confused.

    Conclusion is, don’t mess up something important on your site. If you do, those copy cats will get your site penalized!

    A lot of people say that Duplicate Content Penalty doesn’t exist but doesn’t this look like a penalty?

    • Patrick Altoft

      I’m not sure Sean. There are too many different possibilities to make a firm diagnosis.

  • Nick

    Hi Patrick
    You can use canonical tag this tell Google the original source of your post your posts get De-ranked because you changed your URL structure you have no link juice flowing to your new urls may that was a reason to get your post De-ranked if you had used redirection or canonical tag while changing your URL structure you had not faced De-ranking.

Like what you see? Talk to an Expert