Google Patent reveals plan to fill gaps left by content farms & how quality is judged
The system behind Demand Media is an automated topic suggestion tool which finds keywords & topics and compares them to metrics such as search volume, competitiveness and likely profitability of running ads on the pages. Topics that reach a certain threshold of profitability are queued in the system and writers choose and publish content accordingly.
In view of the negative view Google has of content farms it might surprise you to hear that Google has applied for a patent doing much the same thing. Google has been working killing content farms for over 12 months and it’s worth revisiting a patent released in Feb 2010 now that we have seen the initial effects of the Panda algorithm.
The patent basically covers a system for identifying search queries which have low quality content and then asking either publishers or the people searching for that topic to create some better content themselves. The system takes into account the volume of searches when looking at the quality of the content so for bigger keywords the content would need to be better in order for Google to not need to suggest somebody else writes something.
There are loads of potential ways Google could implement this sort of system:
- Sell story ideas to publishers
- Work with highly trusted partners to get them to write content that Google knows will be good
- Give the data away in their keyword research tool
- Create an aggregation system similar to how reviews are pulled into Google Places to show links to related content
- Have wiki style user contributions sections to search results
Rather than explain all this myself, below are some key quotes and diagrams from the patent, I’ve highlighted the important bits.
The statistics collection and analysis engine 110 can also determine the relevance (e.g., the IR score) of documents associated with a search query (e.g., a topic corpus) to identify a quality associated with the results of the query. In some examples, the statistics collection and analysis engine 110 can combine the relevance with a reputation (e.g., node ranking) to determine quality associated with the topic corpus associated with the query. In some examples, the statistics collection and analysis engine 110 can include a comparator, for example, that can compare the popularity of a search to the quality of the topic corpus. Such a comparison can be used, for example, to determine whether the topic is adequately served by the topic corpus.
In addition to underserved queries entered in a query system, other related data clarifying queries’ meanings or assisting in providing information about the underserved topics can be collected and analyzed by the statistics collection and analysis engine 110. Characteristics directly related to queries such as language distribution, geographic distribution, demographic distribution, and time distribution can also be collected by the statistics collection and analysis engine 110. Queries that are associated with a time distribution can be an indication, for example, that a query is popular around particular holidays, days of the week, and times of the day. In some implementations, query frequency can also be collected, and a source associated with queries can be annotated, for example, when the queries come from multiple sources. Thus, the statistics collection and analysis engine 110 can be configured to collect a variety of information which can be used to analyze content quality and/or popularity.
In some examples, a search engine 120 can notify a searcher that quality associated with the topic corpus is low, for example, in comparison to topics which are similarly popular based on search logs. The search engine 120 can infer from the searcher’s interest in the subject that the searcher is interested in the subject and thus the searcher might have more information that can be included in the topic corpus. The search engine 120 can use the notification to invite the searcher to provide additional content for inclusion in the topic corpus. Notification provided to a searcher with the search results, for example, can help to ensure that the topic is suggested at a similar rate to the demand for the topic.
FIG. 2 is a block diagram of a network environment 200 including a topic distribution engine 210 to suggest topics to content generators 220, 230, 240. A statistics collection and analysis engine 110 can identify areas (e.g., topics) having inadequate content and communicate these areas to a topic distribution engine 210. The topic distribution engine 210 can provide topics comprising the identified areas to content generators 220, 230, 240. In one implementation, topic suggestions can be provided to content generators 220, 230, 240 that have knowledge about the suggested topic. For example, an underserved sports topic can be suggested to a sports-related publisher.
The content generators 220, 230, 240 can include a variety of different mechanisms for creating additional content for the topic corpus. For example, the content generators 220, 230, 240 can include web publishers 220. Web publishers 220 can be, for example, a business enterprise operating to create content for consumers. The web publishers 220, for example, can operate based on an ad sales model 222. In this model, the web publishers can create content available on an associated website for free. The web publishers can then collect visitor statistics and sell advertising space on the associated website to advertisers based on the number of visitors viewing the website.
Alternatively, web publishers 220 can operate on a subscription based model 224. For example, web publishers 220 can sell subscriptions to users in exchange for online access to the content created by the web publisher. Such web publishers can include, for example, newspaper websites, encyclopedia websites, dictionary/thesaurus websites, etc.
Although there is an incentive for web publishers 220 to create web pages, the web publishers 220 frequently are not aware of demand for particular types of information, and thus do not know which information to make available. A search provider (e.g., search engine 120 of FIG. 1) has access to a wide variety of information requests and can also measure the availability of corresponding results. A statistics collection and analysis system 110 can compile instances where few quality search results can be found for a term, and a topics suggestion engine 210 can suggest to the searcher that more information is needed. The search provider has an incentive to provide statistics collection engines 110 and topic distribution engine 210 because the search provider’s goal is to provide high quality information in order to maintain user satisfaction and loyalty to the search provider. When there is no high quality content, the user might become dissatisfied with the search provider.
If the search provider includes a publisher-incentive system, the search provider has an additional incentive to encourage additional content. For example, if the searcher has expressed an interest in the topic (which can be inferred from the entry of query), the search provider can request that the searcher publish web pages on the topic(s), perhaps by researching the subject (offline and/or online) and creating content based on the research.
If the search provider includes a publisher-incentive system, the searcher may receive additional benefits. Publisher incentive systems can also operate to encourage high quality information by comparing the additional content to queries on the topic, and/or node ranking(s) associated with the document after publication. For example, publisher incentive systems can set the incentives according to the demand for the underserved topics, with a higher reward for underserved topics in high demand and a lower reward for topics with less demand, thereby providing a progressive publisher incentive system.
The content generators 220, 230, 240 can also include user contribution sites 230. User contribution sites such as, for example, a wiki site, enable a broad range of users to create and publish content. User contribution sites 230 can create, for example, stub articles 235 based upon suggestions from users. Stub articles 235 can operate, for example, to invite additional contribution from other users that might have knowledge about a subject outside of the knowledge possessed by those creating the stub articles 235. In some implementations, the topic distribution engine 210 can provide article suggestions to user contribution sites 230, based upon the statistics collection and analysis engine 110. The user contribution sites 230 can then generate stub articles 235 based on the article suggestions for inclusion in the user contribution sites 230. Inclusion of a stub article 235 can also operate to notify searchers of inadequate content at a similar frequency to the frequency with which a topic is searched.
The content generators 220, 230, 240 can also include automated content generators 240. Automated content generators, for example, can operate to provide an aggregation 245 of content from multiple sites into a single page. An automated content generator 240, for example, can copy content from multiple sites and generate a single document that includes the copied content. In one implementation, the automated content generator 240 can be configured to copy content only from specified sites. This can enable the automated content generator 240 to copy content only from sites/users with whom the automated content generator 240 has a license. The automated content generator 240 can also provide, for example, an aggregation 245 of links to content related to a particular topic. The automated content generator 240 can be combined with web publishers 220 or user contribution sites 230 to provide stub information for the creation of new content.
In some implementations, the statistics collection and analysis engine 110 can determine content quality across multiple languages. In such implementations, the topic distribution engine 210 can provide the quality associated with results from various languages. The quality results can indicate to content generators 220, 230, 240 that quality of the topic corpus is poor in a particular language, while being adequate in other languages. Such information can be used by content generators 220, 230, 240 to generate additional content in the particular language where the topic corpus is determined to be of poor quality.
The topic distribution engine 210 can also provide topic suggestions through a variety of interfaces. For example, the topic distribution engine 210 can provide a list of topics warranting additional documents using a web interface to information providers such as Wikipedia.
At stage 710, quality associated with a topic is derived. The quality associated with a topic can be derived, for example, by an analysis engine (e.g., statistics collection and analysis engine 110 of FIG. 1). The quality can be derived, for example, based on an aggregation of relevance and ranking associated with documents that satisfy search queries related to the topic. In other examples, quality can be derived by, for example, a click through rate on the search results versus a refinement rate. Other information to derive quality associated with a topic corpus can also be used.
At stage 720, the quality of a topic corpus associated with a topic is compared to a search volume associated with the topic. The comparison can be performed, for example, by an analysis engine (e.g., statistics collection and analysis engine 110 of FIG. 1). The comparison can be made based on corpus quality, for example, in relationship to other topics with similar search volumes (e.g., what corpus quality would be expected based on a given search volume). Alternatively, the comparison can be made based on search volumes, for example, in relationship to other topics with similar corpus quality (e.g., what search volume would be expected based on a given corpus quality).
At stage 730, a determination is made whether the search volume outweighs the topic corpus. This determination can be made, for example, by an analysis engine (e.g., statistics collection and analysis engine 110 of FIG. 1, using a comparator, for example). The search volume outweighs the topic corpus, for example, based upon the comparison of the corpus quality (e.g., quality index) to the search volume (e.g., popularity). Other ways to determine whether the search volume outweighs the topic corpus can be used.
If the search outweighs the content, the topic is marked as underserved and is indexed at stage 740. The topic can be marked and indexed, for example, by an analysis engine (e.g., statistics collection and analysis engine 110 of FIG. 1). In some examples, the topic is marked as underserved, thereby providing for inclusion of the topic within an underserved topics search index. The topic can be indexed, for example, to note a degree to which the topic is underserved by the content (e.g., based on quality of the content associated with the topic). The topic can be alternatively or additionally identified to note a degree to which the topic is in demand (e.g., based on search volumes associated with the topic).