August 1, 2004

Topix.net: The best algorithmic news editing in the business

by skrenta at 2:09 PM

We're launching a new version of Topix.net today, with a next-gen version of our NewsRank story technology. NewsRank powers the the relevance, accuracy and magnitude of the stories categorized on Topix.net.

The new front page uses a complex set of semantic story filters to govern news selection. The fully algorithmic editing process takes into account the magnitude of the story, as well as what the story is about, as determined by our AI categorizer from a Knowledge Base of 150,000 topics.

Other improvements also go live onto the site today, including:

  • Full Coverage sections backing up major stories. This lets users drill down on big stories with multiple viewpoints.
  • Determining the accurate time of the article, as opposed to how recently the story appears on the web and was fetched by our crawler (addressing the phenomenon where a day-old story appears on a news aggregator with "8 minutes ago" as the timestamp).
  • Live Feed on the front page. These are raw headlines coming off of our news crawler. No categorization or ranking has been applied, other than profanity and automated QA filtering.
  • Press release coverage has been added to the business sections.
  • Email alerts are available for every Topix.net category.
  • RSS feeds are now available from our search results page, in addition to the 150k subject and location feeds.
  • Up to 7,000 sources in our news crawl.

Our goal was to create a more compelling news experience than the other aggregators and online news sites. Rather than simply averaging together the top stories from major news outlets, our NewsRank engine is applying a set of editorial rules to guide the story selection process.

We want to de-homogenize the news selection; instead of averaging down, we want Topix.net to find and bring back the most interesting, compelling (and sometimes the oddest) stories from the deep corners of the web. Stories that won't show up on other sites.


Categorized Aggregation is Hard

Topix.net has an aggregated feed for every ZIP code in the US (and every country in the world), as well as hundreds of thousands of other subjects -- health conditions, sports teams, industries, and so on. How do we do it?

Not with human editing, source tagging, or keyword scanning. The Topix.net NewsRank engine is reading each story individually, determining locality and subject information based on the content of the article. NewsRank also condenses 17 dimensions of importance from every story into a single value.

Categorizing sources in order to produce topic aggregations doesn't work. Susan Mernit writes a great blog about online media, but she also writes about food and other personal topics. Blindly adding her entries to a food or media industry aggregation would result in inappropriate posts showing up.

Source-based categorization doesn't work for local, either. The San Francisco Chronicle runs stories that aren't about San Francisco. Conversely, there are many stories about events in SF that show up in news sources based outside of San Francisco. These stories would be missed with source-based tagging.

Keyword-driven filters are also a poor solution. Pulling every story out of the news stream with "San Francisco" in it will not make a good SF rollup, but instead will yield a random jumble of posts, most of which merely mention "San Francisco", but overall have nothing to do with it:

... on a business trip to San Francisco, ...
... an unrestricted free agent from San Francisco, ...
... was bound from Alaska to San Francisco in the winter of 1860 ...
... moved, with her family to San Francisco in 1960, ...

The situation is even worse if the keyword is ambiguous ("Kerry", "Bush", "Springfield").

Our solution is to disambiguate references to people, places and subjects, and match them against our Knowledge Base of 150,000 topics. The result lets our algorithmic story editing technology leverage a much finer-grained idea of what a story is about than simply using the big 7 news categories (US, World, Business, Sci/Tech, Sports, Entertainment, Health.) We can bias up Olympics coverage while slighting movie reviews. Some pages on Topix.net are programmed to slightly favor sensational stories, others to de-emphasize the lurid.

Our complete news system -- article crawler and extractor, story clustering engine, NewsRank determination, topic and locality categorizer, the Topix.net Knowledge Base, and the algorithmic editing system (the "Robo-Editor") comprise the most sophisticated algorithmic news editing system on the net. It's by no means finished though -- so please keep the feature suggestions and bug reports coming and we'll keep improving it. :-)

Update: More on Topix.net's new algorithmic editorial algorithms can be found in this Cyberjournalist article.