February 12, 2005
The Incremental Web
by at 6:16 AM
We see a gigantic hole -- an opportunity -- in online search. Search has become the dominant navigational paradigm for goal-directed reference queries. But search is a poor way to stream new developments around a topic.
Reference Web vs. the Incremental Web
Google searches the reference Internet. Users come to google with a specific query, and search a vast corpus of largely static information. This is a very valuable and lucrative service to provide: it's the Yellow Pages.
Blogs may look like regular HTML pages, but the key difference is that they're organized chronologically. New posts appear at the top, so with a single browser reload you can say "Just show me what's new."
This seems like a trivial difference, but it drives an entirely different delivery, advertising and value chain. Rather than using HTML, the delivery protocol for web pages, there is a desire for a new, feed-centric protocol: RSS. To search chronologically-ordered content, a relevance-based search that destroys the chronology such as Google is inappropriate. Instead you want Feedster, PubSub or Technorati. Feed content may be better to read in a different sort of client, such as Newsgator, rather than a web browser.
And finally, there is a different advertising opportunity. Rather than the sort of business ads you see in the Yellow Pages, instead the ad opportunity is more about reaching a particular demographic or subscriber group. The kind of ads that are in magazines. How do you keyword target a breakfast cereal advertisement to fitness-conscious 21-25 year olds? You can't. You need to find something those people are reading, and put your ad there.
While there's been considerable deployment of goal-directed services, there has been little technology development around automated aggregation of relevant topic streams. Until now, this hasn't been a problem. Most of the growth on the web over the past 10 years has been reference services. But now we're seeing an explosion in the number of sources publishing new incremental content every day. Blogs certainly -- but other sources too, such as news organizations, companies, and our increasingly web-enabled governments are pumping out gigabits of fresh news online every day. There is a vast proliferation of new incremental content underway.
Reference Web,
goal-directedIncremental Web,
subject feedsAmazon NY Times Google News StreetPrices BuzzMachine IMDB Smoking Gun Cars.com Autoblog
It's not appropriate to try to stream this incremental info with keyword searches. It just doesn't work. Say you want a feed of interesting news about Google. A while back I posted something on this blog about Google which you'd probably want to see in such a feed. But the rest of the articles here are not about Google. So you don't want to subscribe to blog.topix if you just want news about Google. But a keyword search for "google" isn't going to deliver a useful experience either -- there are far too many stray mentions of "google" on the web every day. To get a relevant news feed about Google, you either have to have people read everything for you and edit away all the junk, or find an algorithmic technique to do the same.
Human powered techniques work well when the collection to be scanned is small, or if you're trying to cover a handful of subjects. But in the near future, when there are 100X or 1000X the number of posts/day on the blogosphere as there are now, humans won't be able to keep up. Interesting posts in out-of-the-way places like this weblog won't be found in a timely manner, or perhaps not at all. Navigational needs and discovery methods change when you add zeroes to the end of the number of things you're looking through.
This mirrors the evolution of navigation on the web itself.
For a small web (10-30M pages), editorial guides like Yahoo's original directory worked great. But when the web grew to 300M pages, the 50-200 editors couldn't keep up anymore. And when it grew to 10 billion pages, even thousands of editors at a directory like the ODP can't scale. At that point you need algorithms to scale, you need Google.
proto web: bookmarks small web: editorial directories, e.g. Yahoo big web: algorithms, google
An analogous transition will occur for webfeed content.
Relevance
Relevance of new information = freshness X personal context.
PageRank doesn't work for incremental data. News by definition is new, and links take time to accrue. So if you're waiting for the web to vote up a new piece of information before you'll see it, you'll lag behind other news services that can recognize important information the instant it's published. Relevence for a news item is about the importance of the event, the timeliness of notification, and relevance to a topic. This personal context is hard to derive by keyword.
Example: Company goes public -- interesting if you work there, own stock in it, follow the industry of that company, buy the product or live in the town where the company is located. Keywords will find the company name, but maybe not the town, or the industry.
Scaling to the Long Tail
This is what we do at Topix.net. A way to think of us is as a purveyor of 150,000 mailing lists, each focused on a location or a topic. All updated from the broadest variety of relevant sources on the net. We are also finding that audience aggregated by topic in this way is very valuable.
Folks like Jason Calacanis and Nick Denton are doing this with human labor. Car news from Autoblog, or Jason's cancer blog, or Nick's cool gadget blog. These are great sites, and I am convinced that Jason and Nick have figured out the future of publishing and are both going to be hugely successful. A computer-generated product will never replace high-end editorial sites like these.
But they don't have to. Search may be a winner-take-all market, but news isn't. I don't get all of my news from a single source, and neither do you. For comprehensiveness, algorithmic techniques will have to come into play. People-powered systems just don't scale to the long tail. So we are leveraging computers to stream news, not for just 10's or 100's of topics, but for every subject. Mobile home manufacturing. Minot, ND. 5,000 sports teams. 6,000 public companies. Every disease. Every celebrity. And so on... 150,000 topics, updated every 30 minutes 24/7, from every publisher in our crawl.
There are 4-8 million active blogs now. At this size, you can still "know" the top bloggers, and find new posts worth reading by clicking around. But when the blogosphere grows 100X or 1000X, the current discovery model will break down. You'll need algorithmic techniques like Topix.net or a Findory to channel the most relevant material from the constant flood of new content.
Recent Entries
- "Debate"
- PubCon 2008: Whither the Econolypse?
- Mapping local life
- Rewarding innovation
- Topix on Fire: AlwaysOn Global 250, BlogHer, Digital Hollywood, and more
- Topix redesign highlights
- comScore: Topix 4th largest online newspaper
- Vote Now. Vote Three Times. Be a Super Delegate for Topix
- What’s a 100,000 posts per day? A Damn Good Start.
- Welcome to the Neighborhood, Google
Archives
- December 2008
- November 2008
- October 2008
- September 2008
- July 2008
- June 2008
- April 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- August 2006
- July 2006
- June 2006
- May 2006
- April 2006
- March 2006
- February 2006
- January 2006
- November 2005
- October 2005
- September 2005
- August 2005
- June 2005
- May 2005
- April 2005
- March 2005
- February 2005
- January 2005
- December 2004
- November 2004
- October 2004
- September 2004
- August 2004
- July 2004
- June 2004
- May 2004
- April 2004
- March 2004
- February 2004
- January 2004
Powered by Movable Type
About Topix
- About Us
- Advertise
- Contact Us
- FAQ (General)
- Feedback
- Jobs
- Press Room
- Privacy Policy
- Terms of Service
Blogroll
- Rich Skrenta
- Mike Markson
- Blake Williams
- Chris Zaharias
- alarm:clock
- John Battelle
- Susan Mernit
- Micro Persuasion
- Greg Linden
- Jeremy Zawodny
- Search Engine Watch
- ResourceShelf
- Jeff Jarvis
- Traffick
- TechCrunch
- PaidContent
- Allen Morgan
Topix
