I’ve been playing with PubSub for a few months now, ever since Jay Rosen told me that he was using it as a complement to Technorati and Blogpulse to find folks linking to his posts. He mentioned that PubSub was finding links to his posts that the other engines didn’t find… and he’s right – I’m seeing a lot of links through PubSub that I don’t get from the other two, especially from Livejournal, which PubSub seems to cover more closely than the other engines.
(In 1998, Steve Lawrence and Lee Giles wrote a cool paper – “Searching the World Wide Web” – which compared the results of various search engine catalogs to estimate their respective sizes as well as the size of the indexable web. It would be great fun to do the same with blog search engines… except you’d be indexing the pinging blogosphere, which I’m increasingly convinced is substantially smaller than the entire blogosphere. But that’s another post entirely.)
I got to meet Salim Ismail – the smart and charming CEO of PubSub – in Paris a few weeks back, and he joined Rebecca and me for a meeting at Berkman yesterday. (How can you not like a man whose blog is titled “You’ve Got Ismail”? He claims he’s not responsible for choosing the name…)
Salim’s pitch for PubSub to the larger Berkman group is that it’s about “prospective, not retrospective search”. It’s a good soundbite, but it’s taken me a while to unpack what it actually means. Most search engines build catalogs by spidering pages – they retrieve pages published on the web, follow links on those pages to find other pages, and build catalogs of these pages, which users then search. In the bad old days, search engines would sometimes take months to discover new pages – when I worked for Tripod, our acquirer, Lycos, took well over a month to build a new catalog. New homepages created in the interim had to wait weeks to get indexed. Engines like Google do partial catalog builds every day, and pages that get updated frequently (like blogs) can get spidered dozens of times a day. But fundamentally, existing search engines are about pulling down content, creating an indexed catalog and letting you search that index.
PubSub works on a different model. There’s no catalog to search. Instead, PubSub asks you to subscribe to a query and updates you on pieces of content added to the web that match your query. It does this by monitoring as many feeds as it can get access to, including blogs, EDGAR filings, usenet newsgroups – basically, anything that has an RSS or Atom feed. According to the site, 1481 new items from 21 million sources pass through the system every second minute (corrected 1/12/2006) – the system then needs to match each of these new items against all the subscriptions that users have registered (unclear how many subscriptions there are at present, but even with a few hundred thousand, that’s a LOT of computation handled quite quickly.)
Talking with Salim yesterday, one interesting implication of this model became clear – it’s pretty easy to create specialized catalogs of weblogs that can be monitored or “prospectively” searched on. Pubsub currently has a couple of “Community Lists”, a list of comparative popularity of all legal, PR, librarian and fashion blogs, for instance. It’s not hard to imagine doing a list of all Kenyan blogs, for instance, and tracking comparative popularity. Or making it possible to search for “Kibaki” on Kenyan blogs and get posts from that blogosphere that mention this term. I’m not convinced that setting up a competitive dynamic in local blogospheres is the healthiest project to get involved with, but I’m simultaneously intrigued to see whether the most popular African blogger is more popular than the most popular librarian, for instance.
(I don’t mean to suggest that Blogpulse or Technorati couldn’t implement the ability to search against custom catalogs. I’ve suggested to both teams in the past that they offer the feature of searching against a catalog of “most trafficked” blogs… which would allow me to do all sorts of fun research comparing linking and attention structure between all blogs and “A-List” blogs. But these searches on a catalog-based engine probably involve doing a keyword search and then filtering the results to choose only blogs represented in the catalog – kinda like searching for a term on Technorati then sorting results by “authority” of bloggers creating a post. On Pubsub, searching against a custom catalog just means adding an AND to a stored request (“kibaki AND in the set of Kenya blogs”) – my guess is it’s somewhat easier to run these searches, which is why PubSub is featuring this functionality. But hey, this is a blog post and I suspect that if I’m wrong, someone will let me know. Probably rather loudly.)
Tech aside, I wonder whether detailed, ranked catalogs of country blogs would do more to get people interested in Tunisian or Cambodian blogs than the open and chaotic index we currently maintain on the Global Voices wiki, or through the BlogAfrica aggregator. Is it a good idea to encourage the Jordanian blogosphere to compete to see who’s best linked? Does it help non-Jordanians decide which blogs to check out first from the region? Are other local blogospheres as competitive as the Indian blogosphere? And is this competition a good thing or a bad thing?
Anyway, I’m looking forward to learning more about PubSub to see whether it might be another tool we can use at Global Voices to help get people excited about different local blogospheres.
Ethan, thanks for the nice comments. We’ve heard quite a few people say they receive items from us that don’t appear elsewhere.. this is always surprising, especially since we share our pings (via the Feedmesh) with many other aggregators. It merits some investigation to figure out why this is the case.
Small correction to your post.. you mention 1481 items per second… the stat is actually per minute. We see up to 30-50 updates a second coming over the transom. However, you are quite correct in saying that’s a “LOT of computation”. We’ve got over a million subscriptions (persistent queries) in our system, so with 2-3 million blog updates per day, we’re doing trillions of matches per day on our machines. Our core innovation is a matching engine that reconciles new information events against stored queries at a rate of 3 billion matches per second.
The best analogy we’ve found to describe the prospective/retrospective distinction is as follows: Retrospective search is like putting information into a bucket and letting users fish around in it. Prospective search is like putting a filter in a hose and catching information as it flies by. Hence we have a matching engine as opposed to a search engine.
Cheers,
Salim Ismail
co-founder
Do you think that an, Afrodollar would help the continents economic situation? Could it’s creation help in terms of giving the continent greater leaverage versus the rest of the world? Is it an impossibility for it’s creation considering all the problems on the continent?
Comments are closed.