Update: the Google custom search team has retuned their code and fixed most of the problems I outlined here. Very impressive. Here’s a new post on Google’s tweaking of their engine and my gratitude to them for solving this problem.
Like several thousand other geeks out there, I’ve spent a good chunk of this week playing with the latest toy from Google, Co-op Search. The idea behind this new search tool is an excellent one: let users make their own specialized vertical search engines, showing either results only from a selected subset of sites, or prioritizing the results from those sites while searching the whole catalog. The service has all sorts of geeky bells and whistles – you can upload an OPML file to create a catalog, you can weight sites as being good or bad matches for certain terms, you can wrap the whole thing in AJAX and produce your own pretty, customized results.
My friend Nathan pointed me to the tool in response to a question I’d asked for his advice on: how do I let users of Global Voices search the thousands of blogs we’ve pointed to in our 18 month existence?
Basically, there’s two ways to approach this problem. One is to build your own search engine – decide what sites you want to spider, index them with a tool like KinoSearch and put a CGI interface on your site to let users search. (You can also buy search-in-a-box from companies like Google – the principle is the same: you’re building a custom index of sites you think are important.)
The other approach is to take the output of an existing search engine and filter it, looking only at the sites you’re interested in. Savvy Google users know how to do a search with the “site:” attribute – “ghana site:ethanzuckerman.com” gives you the 309 blog posts on my site that have mentioned Ghana. Yahoo!’s search API lets you restrict a search to one of thirty different domains, a very powerful feature which the folks behind Rollyo – a company that urges you to “roll your own search engine” – have used as the technical backbone of their company.
But these options don’t work well when you want to give your users the ability to search on thousands of blogs.
Enter Google Coop Search. You can design a search engine that searches across up to 5000 different domains, orders of magnitude more than Yahoo! allowed you to search. (Some good reviews of Coop Search, especially if you’re looking for a more positive review than this one…)
Fantastic – I fired one up immediately, dropped in OPML files from about half of the Global Voices regional editors and had, within half an hour, a search engine that searches almost 3000 global weblogs.
Unfortunately, it doesn’t search them very well. More specifically, the precision is high, but the recall sucks. (Information retrieval systems are usually measured in terms of how well they perform on these metrics. “Precision” means “how good were the results you got in regards to relavence to your query?” “Recall” means “how complete were your results out of all available relevant documents.”)
Search for “ghana” on our little search engine – you get three results: one from Koranteng’s Toli, one from Timbuktu Chronicles and one from my blog. The results from Koranteng and Emeka are good matches for the search – the one from my blog is curiously bad. But what’s really weird is how few there are – as we saw above, a “site:ethanzuckerman.com” search for “ghana” gives you 309 results. You’ll get 234 on Koranteng’s site and 212 on Emeka’s TChron site – so why aren’t we getting 800+ results from our engine?
A little poking solves the mystery pretty quickly. Google Coop Search works by searching against the main Google search catalog, retrieving 1000 results and filtering them against the sites you’ve included in your catalog. This makes sense, computationally – these searches are fast, almost as fast as normal Google searches. Rather than conducting 3000 “site:” searches and collating and reranking the results, Google is sacrificing recall, getting 1000 results and discarding those not in your set of chosen sites, which requires one call to the index and a really big regular expression match.
Search for “Ghana” on Google, preferably with the number of results per page set to 100. After 300 or so results, you’ll find the Koranteng post our little search engine calls up; at about result 600, you’ll find the Timbuktu Chronicles post on Wireless Ghana. (The result on my site is around number 900, which Google won’t let me see with an ordinary search.)
In other words, the little engine I’ve built is useful only if the sites I’ve chosen are relatively high ranking and authoritative sites on the topics I’m searching on. If I make a search engine of sumo commentary sites and search for “Asashoryu”, the results will be quite good, as those sites probably have several dozen pages that are top matches for the big man. Try it on our engine and you get four results (three from my site…) Alternatively, pick topics where our bloggers are relatively authoritative, and you’ll get better results – try “blogger block“, for instance, and you’ll get 35 sites, either on the blocking of Blogger.com in some countries, or the dreaded disease that seems to strike some bloggers (though not me, so far…)
This doesn’t mean that Coop Search is broken – just that it’s broken for my purposes. Folks will develop lots of interesting search engines, I suspect, using sets of sites that are consistently good matches for the terms they encourage people to search, like my sumo example. But Coop Search isn’t a good solution for authoritative searches on a large set of relatively unpopular blogs, unless one or more of those blogs happen to be very authoritative on the terms you choose. (I could also solve this problem almost immediately by telling Google not just to search my 3000 blogs, just to prioritize them in the index. But that wasn’t the goal of my experiment.)
I’d originally thought that Google might be using Coop Search as a way to identify collections of URLs they might want to spider more deeply – for instance, if I identify 20 great sumo sites, Google might want to visit them more often, or increase their relevancy for searches on sumo. And perhaps they’ll figure a way to do this without opening themselves up to a huge new vector for spammers to promote their sites. But I suspect the truth is that they saw a way to leap ahead of Yahoo! (destroying Rollyo in the process) and offer a tool that’s going to be great fun for 80% of the people who use it. Unfortunately, for the 20% of us who are trying to use Coop Search so we don’t need to go buy our own Google Search Appliance, we’re probably still out of luck.