How user data, links & document scoring may be used in the “brand” algorithm

  • 1
  • August 3, 2009
Patrick Altoft

Patrick Altoft

Director of Strategy

Ever since the brand update hit the UK last month thousands of people have been trying to analyse exactly what signals Google is using to give certain sites a boost in authority & rankings. A lot of people have even been looking at this data since the changes were first noticed in January on but nobody really seems to have found the answer.

There are many SEO companies claiming credit for their large brand clients jumping from obscurity to a top 5 ranking for a big generic keyword, saying that they predicted the change and have been actively optimising towards it for months. Ask them exactly what they have been doing and you won’t get a straightforward answer.

We can guess at a lot of the signals but a lot of the sites ranking for major terms seems to be doing so for absolutely no reason at all. Some of the pages don’t even have any of the target keywords in the title tag and others have zero inbound links. Google is using some kind of unknown algorithm to give pages a boost and nobody in the SEO industry knows what it is. That’s a pretty worrying state of affairs.

My initial thoughts are that it’s a combination of the following factors:

  • Search volume for the brand
  • Direct traffic to the brand site
  • Mentions of the brand name online in trusted sources
  • Homepage links from highly trusted sites (BBC, CNN etc)
  • Does the site have a long standing Wikipedia page (just an idea, but why not?)
  • Number of times brand has been mentioned in the news recently
  • Trusted links to the target page
  • Trusted links to the entire domain

The problem with some of these factors is that they are totally uncontrollable by an average SEO company working on link building and on-page optimisation. Companies spend years building up brand awareness with high street stores, PR campaigns, TV advertising, celebrity endorsements and sponsorship. These are things that can’t be implemented by an SEO company on a 12 month contract.

At the moment the fact nobody in the industry seems to understand the new algorithm isn’t really a major problem because it applies to so few keywords (although the keywords are high volume ones). The big problem comes when Google rolls out this brand algorithm to every single generic and mid-tail keyword and kills the rankings for thousands of non-brand sites in the process. That’s when the non-brand sites will start to ask the question – what on earth are we going to do?

One theory is that non-brands need to invest more in content & social media which is quite true although I’m not sure whether that will be enough to compete with a brand site.

The more useful content you can provide on your site, the more use it is to visitors, and therefore the more it will be talked about elsewhere and referenced; enabling it to become a known brand. Lesser known brands need to concentrate on brand awareness, putting emphasis on using social media to push themselves as recognised specialists in their sector, and understanding their position and standing in that sector’s network. Such brands should create good content and promote themselves with creative and interesting marketing ideas to create uplift in brand awareness. Get people talking about your site; post-Vince, there is even more chance that this will feed your rankings.

Another analysis of the top 30 results for “car insurance” shows that the external mozRank is quite well correlated with rankings which is good news because mozRank is directly controllable by an SEO company.
mozRank for sites in brand update

From this graph Richard concludes:

Inbound anchor text (100% pure or containing other terms) remains a strong ranking factor with emphasis on the quality of the inbound link passing value via that anchor text

All valid points but I’m sure there must be something more to it than content & links.

What’s really happening

A very powerful theory is that Google is using increasingly complex methods of evaluating link & query data together and that this is giving brands much higher trust. The Document Scoring Based on Link-based Criteria patent filed by Google in 2006 lists Matt Cutts as an inventor and describes how

A system may determine time-varying behavior of links pointing to a document, generate a score for the document based, at least in part, on the time-varying behavior of the links pointing to the document, and rank the document with regard to at least one other document based, at least in part, on the score.

The associated thread at WMW should be required reading for any SEO but I will summarise the patent below as well.

[0043] Consider the example of a document with an inception date of yesterday that is referenced by 10 back links. This document may be scored higher by search engine 125 than a document with an inception date of 10 years ago that is referenced by 100 back links because the rate of link growth for the former is relatively higher than the latter. While a spiky rate of growth in the number of back links may be a factor used by search engine 125 to score documents, it may also signal an attempt to spam search engine 125. Accordingly, in this situation, search engine 125 may actually lower the score of a document(s) to reduce the effect of spamming.

In plain English: New content can sometimes be given a boost but it depends on the query. Spiky link growth can be good but only if associated with a spike in search volume too (see my QDF post).

[0045] In one implementation, search engine 125 may modify the link-based score of a document as follows: H=L/log (F+2), where H may refer to the history-adjusted link score, L may refer to the link score given to the document, which can be derived using any known link scoring technique (e.g., the scoring technique described in U.S. Pat. No. 6,285,999) that assigns a score to a document based on links to/from the document, and F may refer to elapsed time measured from the inception date associated with the document (or a window within this period).

[0046] For some queries, older documents may be more favorable than newer ones. As a result, it may be beneficial to adjust the score of a document based on the difference (in age) from the average age of the result set. In other words, search engine 125 may determine the age of each of the documents in a result set (e.g., using their inception dates), determine the average age of the documents, and modify the scores of the documents (either positively or negatively) based on a difference between the documents’ age and the average age.

In plain English: Google knows that newer documents will have less links than older ones and takes this into account in the algorithm. Older documents that regularly attract new links are going to rank better than older documents that don’t attract new links.

[0049] In one implementation, search engine 125 may generate a content update score (U) as follows: U=f(UF,UA), where f may refer to a function, such as a sum or weighted sum, UF may refer to an update frequency score that represents how often a document (or page) is updated, and UA may refer to an update amount score that represents how much the document (or page) has changed over time. UF may be determined in a number of ways, including as an average time between updates, the number of updates in a given time period, etc

[0051] According to one exemplary implementation, UA may be determined as a function of differently weighted portions of document content. For instance, content deemed to be unimportant if updated/changed, such as Javascript, comments, advertisements, navigational elements, boilerplate material, or date/time tags, may be given relatively little weight or even ignored altogether when determining UA. On the other hand, content deemed to be important if updated/changed (e.g., more often, more recently, more extensively, etc.), such as the title or anchor text associated with the forward links, could be given more weight than changes to other content when determining UA.

In plain English: Google looks at whether the important body content of documents is being updated and if a certain query requires fresh results the algorithm will favour documents that are updated on a regular basis.

[0053] In some situations, data storage resources may be insufficient to store the documents when monitoring the documents for content changes. In this case, search engine 125 may store representations of the documents and monitor these representations for changes. For example, search engine 125 may store “signatures” of documents instead of the (entire) documents themselves to detect changes to document content. In this case, search engine 125 may store a term vector for a document (or page) and monitor it for relatively large changes. According to another implementation, search engine 125 may store and monitor a relatively small portion (e.g., a few terms) of the documents that are determined to be important or the most frequently occurring (excluding “stop words”).

[0054] According to yet another implementation, search engine 125 may store a summary or other representation of a document and monitor this information for changes. According to a further implementation, search engine 125 may generate a similarity hash (which may be used to detect near-duplication of a document) for the document and monitor it for changes. A change in a similarity hash may be considered to indicate a relatively large change in its associated document. In other implementations, yet other techniques may be used to monitor documents for changes. In situations where adequate data storage resources exist, the full documents may be stored and used to determine changes rather than some representation of the documents.

In plain English: Google stores every iteration of a page as it changes over time but to save on storage space they only store a representation or signature of the document rather than the entire document.

[0055] For some queries, documents with content that has not recently changed may be more favorable than documents with content that has recently changed.

In plain English: Sometimes Google lowers the rankings of pages that are changed on a regular basis – probably on a query basis.

[0057] According to an implementation consistent with the principles of the invention, one or more query-based factors may be used to generate (or alter) a score associated with a document. For example, one query-based factor may relate to the extent to which a document is selected over time when the document is included in a set of search results. In this case, search engine 125 might score documents selected relatively more often/increasingly by users higher than other documents.

In plain English: If a search result achieves a higher than average click through rate then it may be given a higher ranking.

[0058] Another query-based factor may relate to the occurrence of certain search terms appearing in queries over time. A particular set of search terms may increasingly appear in queries over a period of time. For example, terms relating to a “hot” topic that is gaining/has gained popularity or a breaking news event would conceivably appear frequently over a period of time. In this case, search engine 125 may score documents associated with these search terms (or queries) higher than documents not associated with these terms.

In plain English: A hot topic triggers the Query Deserves Freshness algorithm and gives a boost to topical documents.

[0063] Yet another query-based factor may relate to the extent to which a document appears in results for different queries. In other words, the entropy of queries for one or more documents may be monitored and used as a basis for scoring. For example, if a particular document appears as a hit for a discordant set of queries, this may (though not necessarily) be considered a signal that the document is spam, in which case search engine 125 may score the document relatively lower.

In plain English: If a page starts ranking for more queries than Google thinks it deserves then the algorithm will lower the rankings.

[0068] By analyzing the change in the number or rate of increase/decrease of back links to a document (or page) over time, search engine 125 may derive a valuable signal of how fresh the document is. For example, if such analysis is reflected by a curve that is dropping off, this may signal that the document may be stale (e.g., no longer updated, diminished in importance, superceded by another document, etc.).

In plain English: If you build a lot of links to a document and then slow down that link building the document could appear stale and rankings will drop off.

[0070] For the purpose of illustration, consider y =10 and two documents (web sites in this example) that were both first found 100 days ago. For the first site, 10% of the links were found less than 10 days ago, while for the second site 0% of the links were found less than 10 days ago (in other words, they were all found earlier). In this case, the metric results in 0.1 for site A and 0 for site B. The metric may be scaled appropriately. In another exemplary implementation, the metric may be modified by performing a relatively more detailed analysis of the distribution of link dates. For example, models may be built that predict if a particular distribution signifies a particular type of site (e.g., a site that is no longer updated, increasing or decreasing in popularity, superceded, etc.).

In plain English: Google looks at the rate of link growth over time to indicate whether a document is fresh and can then increase or reduce rankings depending on the query.

[0075] The dates that links appear can also be used to detect “spam,” where owners of documents or their colleagues create links to their own document for the purpose of boosting the score assigned by a search engine. A typical, “legitimate” document attracts back links slowly. A large spike in the quantity of back links may signal a topical phenomenon (e.g., the CDC web site may develop many links quickly after an outbreak, such as SARS), or signal attempts to spam a search engine (to obtain a higher ranking and, thus, better placement in search results) by exchanging links, purchasing links, or gaining links from documents without editorial discretion on making links. Examples of documents that give links without editorial discretion include guest books, referrer logs, and “free for all” pages that let anyone add a link to a document.

In plain English: If a document gets a spike in link growth without being involved in some kind of newsworthy event then the links will be devalued as possible spam.

[0076] According to a further implementation, the analysis may depend on the date that links disappear. The disappearance of many links can mean that the document to which these links point is stale (e.g., no longer being updated or has been superseded by another document). For example, search engine 125 may monitor the date at which one or more links to a document disappear, the number of links that disappear in a given window of time, or some other time-varying decrease in the number of links (or links/updates to the documents containing such links) to a document to identify documents that may be considered stale. Once a document has been determined to be stale, the links contained in that document may be discounted or ignored by search engine 125 when determining scores for documents pointed to by the links.

In plain English: If a page starts to lose links for example if rented links are not renewed then that page will appear stale and may lose rankings. A link from a document that is judged to be stale can be devalued also.

[0080] Alternatively, if the content of a document changes such that it differs significantly from the anchor text associated with its back links, then the domain associated with the document may have changed significantly (completely) from a previous incarnation. This may occur when a domain expires and a different party purchases the domain. Because anchor text is often considered to be part of the document to which its associated link points, the domain may show up in search results for queries that are no longer on topic. This is an undesirable result.

In plain English: Google looks at a change in anchor text in the same way as a change in a document or website. Be careful if you buy an old site and start altering anchor text in internal or external links.

[0084] According to an implementation consistent with the principles of the invention, information relating to traffic associated with a document over time may be used to generate (or alter) a score associated with the document. For example, search engine 125 may monitor the time-varying characteristics of traffic to, or other “use” of, a document by one or more users. A large reduction in traffic may indicate that a document may be stale (e.g., no longer be updated or may be superseded by another document).

In plain English: A document that gets a lot of traffic is given a boost, a document that loses traffic is regarded as stale and has rankings lowered.

[0086] Additionally, or alternatively, search engine 125 may monitor time-varying characteristics relating to “advertising traffic” for a particular document. For example, search engine 125 may monitor one or a combination of the following factors: (1) the extent to and rate at which advertisements are presented or updated by a given document over time; (2) the quality of the advertisers (e.g., a document whose advertisements refer/link to documents known to search engine 125 over time to have relatively high traffic and trust, such as, may be given relatively more weight than those documents whose advertisements refer to low traffic/untrustworthy documents, such as a pornographic site); and (3) the extent to which the advertisements generate user traffic to the documents to which they relate (e.g., their click-through rate). Search engine 125 may use these time-varying characteristics relating to advertising traffic to score the document.

In plain English: Google may reduce or increase rankings depending on how trusted the sites advertising on a particular site are and how much traffic you are sending to them. If you display adverts for Amazon and send them a lot of traffic then you are given a boost. Seems far fetched.

[0089] If a document is returned for a certain query and over time, or within a given time window, users spend either more or less time on average on the document given the same or similar query, then this may be used as an indication that the document is fresh or stale, respectively. For example, assume that the query “Riverview swimming schedule” returns a document with the title “Riverview Swimming Schedule.” Assume further that users used to spend 30 seconds accessing it, but now every user that selects the document only spends a few seconds accessing it. Search engine 125 may use this information to determine that the document is stale (i.e., contains an outdated swimming schedule) and score the document accordingly.

In plain English: Google looks at bounce rates and how long users spend looking at pages to determine whether the page is a good result or not.

[0093] Certain signals may be used to distinguish between illegitimate and legitimate domains. For example, domains can be renewed up to a period of 10 years. Valuable (legitimate) domains are often paid for several years in advance, while doorway (illegitimate) domains rarely are used for more than a year. Therefore, the date when a domain expires in the future can be used as a factor in predicting the legitimacy of a domain and, thus, the documents associated therewith.

[0094] Also, or alternatively, the domain name server (DNS) record for a domain may be monitored to predict whether a domain is legitimate. The DNS record contains details of who registered the domain, administrative and technical addresses, and the addresses of name servers (i.e., servers that resolve the domain name into an IP address). By analyzing this data over time for a domain, illegitimate domains may be identified. For instance, search engine 125 may monitor whether physically correct address information exists over a period of time, whether contact information for the domain changes relatively often, whether there is a relatively high number of changes between different name servers and hosting companies, etc. In one implementation, a list of known-bad contact information, name servers, and/or IP addresses may be identified, stored, and used in predicting the legitimacy of a domain and, thus, the documents associated therewith.

In plain English: Google looks for signals of trust in your domain registration data. Most legitimate companies will be OK here.

[0097] According to an implementation consistent with the principles of the invention, information relating to prior rankings of a document may be used to generate (or alter) a score associated with the document. For example, search engine 125 may monitor the time-varying ranking of a document in response to search queries provided to search engine 125. Search engine 125 may determine that a document that jumps in rankings across many queries might be a topical document or it could signal an attempt to spam search engine 125.

In plain English: If Google notices a site gaining rankings too fast then it will be held back unless the document is considered topical. Probably why building rankings for ecommerce pages takes a long time.

[0099] A query set (e.g., of commercial queries) can be repeated, and documents that gained more than M % in the rankings may be flagged or the percentage growth in ranking may be used as a signal in determining scores for the documents. For example, search engine 125 may determine that a query is likely commercial if the average (median) score of the top results is relatively high and there is a significant amount of change in the top results from month to month. Search engine 125 may also monitor churn as an indication of a commercial query. For commercial queries, the likelihood of spam is higher, so search engine 125 may treat documents associated therewith accordingly.

In plain English: Google is more careful with commercial queries and won’t let sites gain rankings too fast.

[0102] In addition, or alternatively, search engine 125 may monitor the ranks of documents over time to detect sudden spikes in the ranks of the documents. A spike may indicate either a topical phenomenon (e.g., a hot topic) or an attempt to spam search engine 125 by, for example, trading or purchasing links. Search engine 125 may take measures to prevent spam attempts by, for example, employing hysteresis to allow a rank to grow at a certain rate. In another implementation, the rank for a given document may be allowed a certain maximum threshold of growth over a predefined window of time. As a further measure to differentiate a document related to a topical phenomenon from a spam document, search engine 125 may consider mentions of the document in news articles, discussion groups, etc. on the theory that spam documents will not be mentioned, for example, in the news. Any or a combination of these techniques may be used to curtail spamming attempts.

[0103] It may be possible for search engine 125 to make exceptions for documents that are determined to be authoritative in some respect, such as government documents, web directories (e.g., Yahoo), and documents that have shown a relatively steady and high rank over time. For example, if an unusual spike in the number or rate of increase of links to an authoritative document occurs, then search engine 125 may consider such a document not to be spam and, thus, allow a relatively high or even no threshold for (growth of) its rank (over time).

In plain English: It’s almost impossible to improve rankings faster than Google wants you to – only way around this is to do link building as part of a news/buzz campaign to make them appear natural.

[0106] According to an implementation consistent with the principles of the invention, user maintained or generated data may be used to generate (or alter) a score associated with a document. For example, search engine 125 may monitor data maintained or generated by a user, such as “bookmarks,” “favorites,” or other types of data that may provide some indication of documents favored by, or of interest to, the user. Search engine 125 may obtain this data either directly (e.g., via a browser assistant) or indirectly (e.g., via a browser). Search engine 125 may then analyze over time a number of bookmarks/favorites to which a document is associated to determine the importance of the document.

In plain English: Google may look at how many people add a certain site/page to their browser favourites / bookmarks and use this as a signal of trust. The rate of addition & deletion of the favourites is also looked at.

[0108] In an alternative implementation, other types of user data that may indicate an increase or decrease in user interest in a particular document over time may be used by search engine 125 to score the document. For example, the “temp” or cache files associated with users could be monitored by search engine 125 to identify whether there is an increase or decrease in a document being added over time. Similarly, cookies associated with a particular document might be monitored by search engine 125 to determine whether there is an upward or downward trend in interest in the document.

In plain English: Google might be looking at your history & cache to see which sites you visit and giving those sites a boost.

[0110] According to an implementation consistent with the principles of the invention, information regarding unique words, bigrams, and phrases in anchor text may be used to generate (or alter) a score associated with a document. For example, search engine 125 may monitor web (or link) graphs and their behavior over time and use this information for scoring, spam detection, or other purposes. Naturally developed web graphs typically involve independent decisions. Synthetically generated web graphs, which are usually indicative of an intent to spam, are based on coordinated decisions, causing the profile of growth in anchor words/bigrams/phrases to likely be relatively spiky.

[0111] One reason for such spikiness may be the addition of a large number of identical anchors from many documents. Another possibility may be the addition of deliberately different anchors from a lot of documents. Search engine 125 may monitor the anchors and factor them into scoring a document to which their associated links point. For example, search engine 125 may cap the impact of suspect anchors on the score of the associated document. Alternatively, search engine 125 may use a continuous scale for the likelihood of synthetic generation and derive a multiplicative factor to scale the score for the document.

In plain English: Unnatural anchor text growth is a signal of spam and Google will reduce the value of the links accordingly. Natural links with a range of anchor texts are better.

[0114] A sudden growth in the number of apparently independent peers, incoming and/or outgoing, with a large number of links to individual documents may indicate a potentially synthetic web graph, which is an indicator of an attempt to spam. This indication may be strengthened if the growth corresponds to anchor text that is unusually coherent or discordant. This information can be used to demote the impact of such links, when used with a link-based scoring technique, either as a binary decision item (e.g., demote the score by a fixed amount) or a multiplicative factor.

In plain English: A site gaining lots of links from unrelated sites with the same anchor text has the value of those links reduced.

[0117] A significant change over time in the set of topics associated with a document may indicate that the document has changed owners and previous document indicators, such as score, anchor text, etc., are no longer reliable. Similarly, a spike in the number of topics could indicate spam. For example, if a particular document is associated with a set of one or more topics over what may be considered a “stable” period of time and then a (sudden) spike occurs in the number of topics associated with the document, this may be an indication that the document has been taken over as a “doorway” document. Another indication may include the disappearance of the original topics associated with the document. If one or more of these situations are detected, then search engine 125 may reduce the relative score of such documents and/or the links, anchor text, or other data associated the document.

In plain English: If a site starts producing a lot of off-topic content Google may reduce the trust of that site.

What this all means

Google is using hundreds of signals and not all of them are within our control. We need to be smart and use huge amounts of intelligence with link-building to avoid documents being labelled as stale. Don’t go out and build loads of links for new documents because you can’t keep the same rate of linkbuilding going – start slow and build up naturally.

If you want to build links then you have to build them as part of a buzz/PR campaign otherwise they are going to be devalued. Any increase in links or attempt to improve rankings needs to be done for a reason – you can’t do it in isolation.

I will leave you with a quote from Tedster at WMW who sums it up nicely:

Here’s my current idea. I believe that Google’s staff contains more statisticians than any other specialty. The algo is, more and more, driven by statistics and probability. These statisticians watch query data as well as backlink data. That’s what jumped out at me while re-reading this patent: backlinks PLUS queries.

Google’s statisticians know what queries currently show bursts of interest from the general public. They know what companies are getting navigational queries – and they know when any online business is truly growing in brand recognition. For example, queries like [company keyword] will start increasing if there is a real growing interest. We puzzle over “Update Vince”? How about defining “brand” by folding data on navigational queries into the ranking algo.

When backlink numbers start growing, then that new “interest” at the webmaster level should be supported by the general population’s query data. In other words, if the backlink growth is relatively “natural”, then it should show a certain statistical footprint.

If the spike in backlink growth is too far outside that statistical footprint, then Google will take steps to limit the effect of that apparent SERP manipulation.

The statistically normal expectations are, by this time, quite granular and gaining in sophistication. The patent I mentioned in the opening post lists many possible measures that Google can take to determine when patterns are outside the natural range. And they’re probably making many others we haven’t even guessed at.

This is my current brainstoring area, and it’s why I recommend the idea of ATTRACTING backlinks more than “building” them. Backlinks alone cannot create a statistically correct footprint for a growing, thriving website. Even though such a “dummied-up” impression has been a working tool for improved ranking in the past, it’s a tool whose future is getting more and more cloudy.