February 12, 2005

The Incremental Web

by skrenta at 6:16 AM

We see a gigantic hole -- an opportunity -- in online search. Search has become the dominant navigational paradigm for goal-directed reference queries. But search is a poor way to stream new developments around a topic.

Reference Web vs. the Incremental Web

Google searches the reference Internet. Users come to google with a specific query, and search a vast corpus of largely static information. This is a very valuable and lucrative service to provide: it's the Yellow Pages.

Blogs may look like regular HTML pages, but the key difference is that they're organized chronologically. New posts appear at the top, so with a single browser reload you can say "Just show me what's new."

This seems like a trivial difference, but it drives an entirely different delivery, advertising and value chain. Rather than using HTML, the delivery protocol for web pages, there is a desire for a new, feed-centric protocol: RSS. To search chronologically-ordered content, a relevance-based search that destroys the chronology such as Google is inappropriate. Instead you want Feedster, PubSub or Technorati. Feed content may be better to read in a different sort of client, such as Newsgator, rather than a web browser.

And finally, there is a different advertising opportunity. Rather than the sort of business ads you see in the Yellow Pages, instead the ad opportunity is more about reaching a particular demographic or subscriber group. The kind of ads that are in magazines. How do you keyword target a breakfast cereal advertisement to fitness-conscious 21-25 year olds? You can't. You need to find something those people are reading, and put your ad there.

Reference Web,
Incremental Web,
subject feeds
Amazon NY Times
Google Google News
StreetPrices BuzzMachine
IMDB Smoking Gun
Cars.com Autoblog
While there's been considerable deployment of goal-directed services, there has been little technology development around automated aggregation of relevant topic streams. Until now, this hasn't been a problem. Most of the growth on the web over the past 10 years has been reference services. But now we're seeing an explosion in the number of sources publishing new incremental content every day. Blogs certainly -- but other sources too, such as news organizations, companies, and our increasingly web-enabled governments are pumping out gigabits of fresh news online every day. There is a vast proliferation of new incremental content underway.

It's not appropriate to try to stream this incremental info with keyword searches. It just doesn't work. Say you want a feed of interesting news about Google. A while back I posted something on this blog about Google which you'd probably want to see in such a feed. But the rest of the articles here are not about Google. So you don't want to subscribe to blog.topix if you just want news about Google. But a keyword search for "google" isn't going to deliver a useful experience either -- there are far too many stray mentions of "google" on the web every day. To get a relevant news feed about Google, you either have to have people read everything for you and edit away all the junk, or find an algorithmic technique to do the same.

Human powered techniques work well when the collection to be scanned is small, or if you're trying to cover a handful of subjects. But in the near future, when there are 100X or 1000X the number of posts/day on the blogosphere as there are now, humans won't be able to keep up. Interesting posts in out-of-the-way places like this weblog won't be found in a timely manner, or perhaps not at all. Navigational needs and discovery methods change when you add zeroes to the end of the number of things you're looking through.

This mirrors the evolution of navigation on the web itself.

proto web: bookmarks
small web: editorial directories, e.g. Yahoo
big web: algorithms, google
For a small web (10-30M pages), editorial guides like Yahoo's original directory worked great. But when the web grew to 300M pages, the 50-200 editors couldn't keep up anymore. And when it grew to 10 billion pages, even thousands of editors at a directory like the ODP can't scale. At that point you need algorithms to scale, you need Google.

An analogous transition will occur for webfeed content.


Relevance of new information = freshness X personal context.

PageRank doesn't work for incremental data. News by definition is new, and links take time to accrue. So if you're waiting for the web to vote up a new piece of information before you'll see it, you'll lag behind other news services that can recognize important information the instant it's published. Relevence for a news item is about the importance of the event, the timeliness of notification, and relevance to a topic. This personal context is hard to derive by keyword.

Example: Company goes public -- interesting if you work there, own stock in it, follow the industry of that company, buy the product or live in the town where the company is located. Keywords will find the company name, but maybe not the town, or the industry.

Scaling to the Long Tail

This is what we do at Topix.net. A way to think of us is as a purveyor of 150,000 mailing lists, each focused on a location or a topic. All updated from the broadest variety of relevant sources on the net. We are also finding that audience aggregated by topic in this way is very valuable.

Folks like Jason Calacanis and Nick Denton are doing this with human labor. Car news from Autoblog, or Jason's cancer blog, or Nick's cool gadget blog. These are great sites, and I am convinced that Jason and Nick have figured out the future of publishing and are both going to be hugely successful. A computer-generated product will never replace high-end editorial sites like these.

But they don't have to. Search may be a winner-take-all market, but news isn't. I don't get all of my news from a single source, and neither do you. For comprehensiveness, algorithmic techniques will have to come into play. People-powered systems just don't scale to the long tail. So we are leveraging computers to stream news, not for just 10's or 100's of topics, but for every subject. Mobile home manufacturing. Minot, ND. 5,000 sports teams. 6,000 public companies. Every disease. Every celebrity. And so on... 150,000 topics, updated every 30 minutes 24/7, from every publisher in our crawl.

There are 4-8 million active blogs now. At this size, you can still "know" the top bloggers, and find new posts worth reading by clicking around. But when the blogosphere grows 100X or 1000X, the current discovery model will break down. You'll need algorithmic techniques like Topix.net or a Findory to channel the most relevant material from the constant flood of new content.