The NY Times has an article today discussing how Google and other search engines are trying to index the “deep web” – databases and other content that were previously invisible.
Last year Google started entering random keywords into millions of search forms on the web to try and expose the databases of results that lie behind them.
To extract meaningful data from the Deep Web, search engines have to analyze users’ search terms and figure out how to broker those queries to particular databases. For example, if a user types in “Rembrandt” the search engine needs to know which databases are most likely to contain information about art ( say, museum catalogs or auction houses), and what kinds of queries those databases will accept.
That approach may sound straightforward in theory, but in practice the vast variety of database structures and possible search terms poses a thorny computational challenge.
“This is the most interesting data integration problem imaginable,” says Alon Halevy, a former computer science professor at the University of Washington who is now leading a team at Google that is trying to solve the Deep Web conundrum.
Google’s Deep Web search strategy involves sending out a program to analyze the contents of every database it encounters. For example, if the search engine finds a page with a form related to fine art, it starts guessing likely search terms – “Rembrandt,” “Picasso,” “Vermeer” and so on – until one of those terms returns a match. The search engine then analyzes the results and develops a predictive model of what the database contains.