The rights and wrongs of Canonical URLs

Wikipedia: “URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs may be equivalent.”

Canonical URLs have had a lot of press during the last year since Google started ‘supporting’ them in order to combat duplicate content issues – amongst other things. I’m going to look at how carnonical URLs should be used, and what the consequences can be if used incorrectly.

It’s amazing how much Canonical URLs are still mis-used, or even unused where they could give a real benefit. Matt Cutts recently wrote a blog article covering the issues of Canonical URLs. It’s well worth taking the time to read through the article, and also watching the YouTube videos, as they give an insight into how Google handles these. I’ve also included an overview of how the top 3 search engines handle the Canonical URL issue at the bottom of this page.

The point:

Canonical URLs tell search engines where the definitive/master/correct version of the content on this page exists.

What do they look like, and where do I put them?

Canonical URLs are for search engines. Users should not be hindered when accessing a page through various URLs. This is a core marketing tactic – so you can see how to funnel users in the best way in order to help them end up at your product.

Canonical URLs look like this:

The canonical URL sits inside a <link> tag inside the <head> tag of your document.

Duplicate content issues:

Websites can have duplicate content issues for a variety of reasons. Take for example any standard WordPress installation; and in-fact, this very site, and this very page.

There are many ways that you can find this page, but there are also many URLs for this page. http://www.branded3.com/b3labs/the-rights-and-wrongs-of-canonical-urls/
and http://www.branded3.com/b3labs/?p=3249 both resolve to this page – and also notice that WordPress doesn’t 301 redirect you to the main URL.

So, if you split all the links to this page in half, and half went to one URL, and half went to the other, what exactly is Google supposed to do?

How do search engines cope?

Short answer – they don’t. How is a search engine supposed to identify which page is the ‘correct’ or ‘master’ page? Without being told what to do, it can’t. Instead, it sees two pages. The issue here is that those two pages also have exactly the same content. This means that you might get penalised by the search engine for having multiple pages with the same content.

Another example often given is that of an online shop. When viewing a product page, you might get to it directly, via a category, a sub category or a search. From these various routes, you might pick up query strings in the URL.

Eg. http://www.example.com/product.php?item=coca-cola
or http://www.example.com/product.php?item=coca-cola&category=drinks
or http://www.example.com/product.php?item=coca-cola&category=drinks&subcategory=fizzy.

As well as being pretty bad architecture in general in some respects, this highlights further the issue of duplicate content.

How to fix this with canonical tags:

As the opening paragraph states, this is a way of standardising the URL. We need to decide which URL to treat as the ‘master’ and whenever the content is displayed, display this URL.

Taking our first example of a WordPress site, adding a canonical URL of http://www.branded3.com/b3labs/canonical-urls-rights-and-wrongs to both pages tells a search engine that the definitive version of the page content is at http://www.branded3.com/b3labs/canonical-urls-rights-and-wrongs. Therefore, no penalty.

Other implications of a canonical tag:

Other services also look at canonical tags, such as Facebook. When using their ‘like’ button, you can pass a URL across to Facebook, and that URL will be liked. However, Facebook is pretty clever.

It’s sharing system goes back to the site in order to pull in Open Graph Data (if it’s there) or generally scrape the page to bring a good overview of the site back to Facebook to share. If Facebook’s sharing system notices a Canonical URL, it will scrape that site page instead.

Using this incorrectly means you can essentially fake ‘like’ buttons into liking a different URL (or even domain).

With great power, comes great responsibility…

Using Canonical URLs incorrectly can lead to some interesting issues – with both good and bad consequences.

Sneaky SEO Implications:

Importantly, Google states that when canonical URLs are used: “Additional URL properties, like PageRank and related signals, are transferred as well.” This means there could be some sneaky SEO benefits to using them.

A lot of AD driven websites chunk their content into pages. This improves impression rates (as well as annoying readers!) A good example of this can be found over at wired.com. You’ll notice that the article I’ve linked to is split over multiple pages.

You’ll also notice that the inner pages have a canonical URL back to http://www.wired.com/gamelife/2011/05/3ds-virtual-console/ But why should they? In theory, they are all separate pages with separate content. Therefore, they are pages in their own right and don’t ‘share’ content at all. This could relate to the ‘view all’ function, however, if pressed on any page, it always redirects to itself with a ‘viewall’ attribute added to the URL.

In my eyes, this is NOT how canonical URLs should be used, and they must be doing it so that the page is crawled by browsers and any ranking factors are then passed back to the front page. This means that any links that are inbound to any sub pages (and there are probably a lot of links to a site of this stature – will pass their weight.

cross-domain rel="canonical" allows users to reference another domain for the content on a page. Used correctly, this is for when you move domains, or if you have multiple domains all referencing the same content.

Used incorrectly, you can pass weight to a page/site at an external domain by linking to one with a cross-domain canonical URL. Say for example, I get 1000 links to a page; Google will weight this page accordingly. If it has a cross-domain canonical URL of of a different site, the weight will pass to that other page.

How to kill your site if you do it wrong:

There are quite a few examples of how canonical urls can have an adverse effect on your site. If you use them incorrectly, then Google will drop the listing for your page. If you implement them using any form of CMS, this can quickly lead to your site being de-indexed before you even know that there is a problem!

See the following for a few examples of how canonical URLs can have a detrimental effect on your site:

http://www.malcolmcoles.co.uk/blog/rel-canonical-infinite-express/
http://www.socialpatterns.com/search-engine-marketing/msn-hit-by-canonical-url-problems/
http://www.dailymail.co.uk/sciencetech/article-1378504/The-Independent-embarrassed-Kate-Middleton-jelly-bean-URL-twitter-fiasco.html – sorry for linking to the Daily Mail…

What do the search engines say about canonical URLs?

Google:

Is rel=”canonical” a hint or a directive?
It’s a hint that we honor strongly. We’ll take your preference into account, in conjunction with other signals, when calculating the most relevant page to display in search results.

Can I use a relative path to specify the canonical, such as ?
Yes, relative paths are recognized as expected with the tag. Also, if you include a link in your document, relative paths will resolve according to the base URL.

Is it okay if the canonical is not an exact duplicate of the content?
We allow slight differences, e.g., in the sort order of a table of products. We also recognize that we may crawl the canonical and the duplicate pages at different points in time, so we may occasionally see different versions of your content. All of that is okay with us.

What if the rel=”canonical” returns a 404?
We’ll continue to index your content and use a heuristic to find a canonical, but we recommend that you specify existent URLs as canonicals.

What if the rel=”canonical” hasn’t yet been indexed?
Like all public content on the web, we strive to discover and crawl a designated canonical URL quickly. As soon as we index it, we’ll immediately reconsider the rel=”canonical” hint.

Can rel=”canonical” be a redirect?
Yes, you can specify a URL that redirects as a canonical URL. Google will then process the redirect as usual and try to index it.

What if I have contradictory rel=”canonical” designations?
Our algorithm is lenient: We can follow canonical chains, but we strongly recommend that you update links to point to a single canonical page to ensure optimal canonicalization results.

Yahoo!:

The URL paths in the tag can be absolute or relative, though we recommend using absolute paths to avoid any chance of errors.

A tag can only point to a canonical URL form within the same domain and not across domains. For example, a tag on http://test.example.com can point to a URL on http://www.example.com but not on http://yahoo.com or any other domain.

The tag will be treated similarly to a 301 redirect, in terms of transferring link references and other effects to the canonical form of the page.

We will use the tag information as provided, but we’ll also use algorithmic mechanisms to avoid situations where we think the tag was not used as intended. For example, if the canonical form is non-existent, returns an error or a 404, or if the content on the source and target was substantially distinct and unique, the canonical link may be considered erroneous and deferred.

The tag is transitive. That is, if URL A marks B as canonical, and B marks C as canonical, we’ll treat C as canonical for both A and B, though we will break infinite chains and other issues.

Bing:

This tag will be interpreted as a hint by Live Search, not as a command. We’ll evaluate this in the context of all the other information we know about the website and try and make the best determination of the canonical URL. This will help us handle any potential implementation errors or abuse of this tag.
You can use relative or absolute URLs in the “href” attribute of the link tag.

The page and the URL in the “href” attribute must be on the same domain. For example, if the page is found on “http://mysite.com/default.aspx”, and the “href” attribute in the link tag points to “http://mysite2.com”, the tag will be invalid and ignored.

However, the “href” attribute can point to a different subdomain. For example, if the page is found on “http://mysite.com/default.aspx” and the “href” attribute in the link tag points to “http://www.mysite.com”, the tag will be considered valid.

By Douglas Radburn. at 9:21AM on Tuesday, 17 May 2011

Doug is our Senior Open Source Web Developer since bringing his knowledge and skills to Branded3 in 2009. A founding developer of our Twitition and Competwition platforms, Doug has also been lead in our Open Source Projects on Magento ecommerce solutions and Wordpress CMS platforms. Follow Douglas Radburn on Twitter.

comments

Comments are closed.