Is Google’s reasonable surfer model no longer a model at all?
The original PageRank model used a random surfer model and assumed that people moved randomly throughout the web. Crucially every link on a page had an equal chance of being clicked as all the others.
PageRank can be thought of as a model of user behavior. We assume there is a “random surfer” who is given a web page at random and keeps clicking on links, never hitting “back” but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank.
The problem with this is that all links are not equal – some are much more important than others. In 2004 Google created the Reasonable Surfer Model which attempts to algorithmically determine which links are more important than others.
Systems and methods consistent with the principles of the invention may provide a reasonable surfer model that indicates that when a surfer accesses a document with a set of links, the surfer will follow some of the links with higher probability than others.
This reasonable surfer model reflects the fact that not all of the links associated with a document are equally likely to be followed. Examples of unlikely followed links may include “Terms of Service” links, banner advertisements, and links unrelated to the document.
The patent for this model has been recently granted which is why it’s been getting a lot of attention in the SEO industry.
This patent is interesting because it shows how, even in 2004, Google was trying to figure out which links were more trustworthy than others. The factors they were looking at included a lot of common sense things such as the list below, as well as user data collected by toolbars.
- Font size of anchor text associated with the link;
- The position of the link (measured, for example, in a HTML list, in running text, above or below the first screenful viewed on an 800 X 600 browser display, side (top, bottom, left, right) of document, in a footer, in a sidebar, etc.);
- If the link is in a list, the position of the link in the list;
- Font color and/or other attributes of the link (e.g., italics, gray, same color as background, etc.);
- Number of words in anchor text of a link;
- Actual words in the anchor text of a link;
- How commercial the anchor text associated with a link might be;
- Type of link (e.g., text link, image link);
- If the link is an image link, what the aspect ratio of the image might be;
- The context of a few words before and/or after the link;
- A topical cluster with which the anchor text of the link is associated;
- Whether the link leads somewhere on the same host or domain;
- If the link leads to somewhere on the same domain,
- whether the link URL is shorter than the referring URL; and/or
- whether the link URL embeds another URL (e.g., for server-side redirection)
What is Google doing now?
This model dates back to 2004 and a lot of the factors Google wanted to look at are quite hard to figure out. For Google to look at CSS and try to figure out which links are more prominent on the page is pretty hard to do accurately.
The part of the model which is interesting is the bit about user data – Google has a lot of user data thanks to Analytics, Toolbars and also all the links clicked in services like Google Reader. Many people are noticing a boost in links that get lots of tweets, others are noticing that links from certain high traffic sites seem to pass more weight than similar links on low traffic sites despite having similar PR/mozRank etc
If Google really wanted to make their link algorithm trusted then looking at how many real users are clicking on links would be a very accurate way of doing it.