August 27, 2004

The Daily Internet

by skrenta at 8:47 AM

Jeremy Zawodny expands on a review of feed search engines: Why Hasn't Anyone Figured Out How To Do Feed Searches?

The kind of searches I regularly do on Feedster and Technorati just aren't available on Google. No amount of fiddling with the advanced search options, rooting around on their labs site, or searching for obscure options will scan incremental new material from half an hour ago. Yahoo search seems to have surpassed Google with some advanced features, but they don't have an effective reverse chronological "sort by date" either.

Google has been experimenting with crawling rss feeds for some time, but we haven't seen any blog or feed search from them yet. Google News and Freshbot suggest they clearly have the technology to roll out a best-of-breed version of feed search. And they certainly don't hold back when it comes to trying out live product experiments. Scott Rafer of Feedster disagrees, and argues that getting feed search right is harder than it looks. He may have a point; clicking "sort by date" in Google News often sucker-punches the relevance. But still it seems odd, that with the biggies supposedly in cutthroat competition for search, that they've left the field for Feedster as the best resource for this class of searches. Why?

Feeds & Blogs: Fad or something big?

In the 90's, Usenet was the largest distributed message board system on the net, and I depended on it and DejaNews (which provided Usenet search) to learn about new technology, troubleshoot problems, and research buying decisions. Folks on Usenet had something to say about everything -- cars, restaurants, apartments, houses, TVs, brands of computers, etc. -- and it was an indispensable source of research for me.

But I was part of a small technical minority, and over time Usenet declined in importance. Fewer and fewer people read and posted to it, and it lost its utility. Google bought DejaNews and even made it a tab on Google.com, but Usenet never went mainstream.

Now the blogosphere seems like the second coming of Usenet to me. Instead of 80-character courier posts and > quoting, we have rich-text HTML and images. There are subtle model differences, however, which should help protect the blogophere against the anarchy and spam that brought Usenet down. Instead of newsgroups with perpetual turf wars, posts are maintained per-author, and trackbacks allow threading, but with built-in spam resistance. But the question remains: is this going to be the domain a small, self-selected technical (or perhaps literary) elite, or will it have broader mainstream significance?

Perhaps its heresy to say this on a blog, but I run into a lot of blog-skeptics here in silicon valley (I even work with some :-). They have an instinctual resistance to the wild-eyed enthusiasm surrounding blog media, learned from years of watching supposed next-big-things skulk away into memory (sometimes after considerable investment), and dismiss blogs as a fad. Or at best, like Usenet, forever the domain of an insular group that won't ever crossover to a mainstream audience (evidence to the contrary notwithstanding.) Others may be taking a pragmatic view, seeing blogs simply as a cost-effective way to reach critical influencers, and thus make use of the blogosphere as a useful PR channel, regardless of whether blogs eventually "make it" as a mainstream media source or not.

Given Google's previous experience with Usenet, their powerful incremental search platform, and a healthy skepticism about embracing the Next Big Thing, they're likely pursuing a "smart follower" wait-and-see approach. If blogs and feed search do turn out to be critical, they'll be able to show up with a best-of-breed solution in short order.

But is this just another search application, or something bigger?

Unlike Usenet, hidden away on a network that only a few had access too, behind character-cell newsreading interfaces, blogs are indistinguishable based on appearance or use from other Internet media outlets. Users often can't tell the difference between a high-quality blog and a "mainstream media" news site. And there are lots and lots of blogs. This material is part of what a typical internet user will be exposed to and consume. Collectively, it will become the largest source of incrementally published content on the net.

Search Engines are Phone Books

Local media advertising is dominated by two heavyweights: the phone books and the newspapers. Search engines are phone books; they provide a table of contents to the web. The goal of a search engine is to be as objective as possible; if you enter "ibm" into Google and ibm.com isn't the first result, there's no subjective editorial judgment, it's just broken.

This is the most lucrative point to hit folks with advertising. Double-digit CTRs and high conversion rates are the norm. Searches for "saturn vue" or "sunnyvale dentist" are valuable; by typing those terms into a search engine, a user turns themself into a lead, often worth several dollars for a single click.

But phone books are boring. Nobody reads the phonebook for fun, and you can't advertise some things in the phone book. Nobody googles for lunch ("94303 cheeseburger"?) Sales of things people don't know they're looking for (tire sales, two-for-one pizzas), and marketing to demographics (e.g. advertising cosmetics to teenage girls) don't fit well with keyword advertising. For this, ad copy must be paired with dynamic content.

The Daily Internet

The proliferation of incremental content sources, all pumping out new material on a regular basis, is what the mainstream Internet user will consume. It's the difference between doing research or reading a magazine. At Topix.net we believe that editorial automation is necessary to manage this massive, growing content stream. Other startups like Feedster and Technorati are also focused on improving access to the incremental Internet. This is the future of audience on the net, as well as the next online advertising frontier. Rumors indicate Yahoo has something big in the works to embrace this shift, it will be interesting to see how MSN and Google respond.

Roundup

by skrenta at 7:47 AM

Jason Calacanis announces a redesign for Weblogs, Inc., and the addition of Google AdSense for monetization. The redesign looks great.

John Battelle riffs on evolving ad models and sparks a lively discussion.

Topix.net and A9 have the same amount of traffic. They're cheating with all of those amazon.com links though. ;-)

We've added a new education hierarchy to topix, with a pilot Standardized Testing page. Planning to add pages for school vouchers & charter schools. Other educational topic suggestions are welcome.

Mark Fletcher of Bloglines analyzes a blog meme propagation experiment and concludes that he has 42% market share for his web-based RSS aggregator.

Blogs by Google employees (via Steve Rubel at MicroPersuasion).

Greg Linden: Microsoft personalized search by December?

NewsDesigner: An amusing not-for-publication mockup slips onto a newspaper's live site.

Two introspections from the grey lady: What to Do When News Grows Old Before Its Time, and What Belongs on the Front Page of The New York Times. We're constantly trying to improve the editorial/time/magnitude mix on topix's front page, so getting in the head of real journalists and editors to understand how they think about their publications helps us better understand the tradeoffs and goal-space. Of course, our "mix" on the news is a different sort of animal entirely from a print newspaper's front page, but it's fascinating to see how small tweaks to our algorithms noticeably change the character of the site.

August 22, 2004

Web 2.0 Conference - The Web as Platform

by skrenta at 3:10 PM

John Battelle has posted the lineup of speakers and events at the Web 2.0 conference, and it's amazing. Topix.net is proud to be a media sponsor for this event as it looks to be one of the best forward-looking technology conferences of the year. Besides the huge roster of industry heavyweights, a number of new search startups will be announcing at this event.

August 2, 2004

Topix.net at Search Engine Strategies

by skrenta at 8:08 PM

Topix.net will be exhibiting this week at the Search Engine Strategies show in San Jose. If you'll be at the show, please stop by our booth and say hi. It's always nice to meet advertisers, users and partners face-to-face.

We also have some show specials, so come by to drop your business card in our fishbowl for the daily drawing and take away an ad coupon. :-)

Update: The SES show was great for us, it totally exceeded my expectations for the event. The last time I'd been to SES was around 2000 , when we were with Netscape/AOL representing DMOZ. The show's a lot bigger now, and we talked to a lot of interesting companies with good bizdev opportunities for Topix.net. Chris Tolles from Topix spoke on a panel on the last day.

One of the best parts of the show for me was having Jacob Nielsen stop by our booth and spend a generous amount of time going over our redesign and offering feedback. Wow!

We also got good coverage throughout the week for our redesign & new features launch.

August 1, 2004

Topix.net: The best algorithmic news editing in the business

by skrenta at 2:09 PM

We're launching a new version of Topix.net today, with a next-gen version of our NewsRank story technology. NewsRank powers the the relevance, accuracy and magnitude of the stories categorized on Topix.net.

The new front page uses a complex set of semantic story filters to govern news selection. The fully algorithmic editing process takes into account the magnitude of the story, as well as what the story is about, as determined by our AI categorizer from a Knowledge Base of 150,000 topics.

Other improvements also go live onto the site today, including:

  • Full Coverage sections backing up major stories. This lets users drill down on big stories with multiple viewpoints.
  • Determining the accurate time of the article, as opposed to how recently the story appears on the web and was fetched by our crawler (addressing the phenomenon where a day-old story appears on a news aggregator with "8 minutes ago" as the timestamp).
  • Live Feed on the front page. These are raw headlines coming off of our news crawler. No categorization or ranking has been applied, other than profanity and automated QA filtering.
  • Press release coverage has been added to the business sections.
  • Email alerts are available for every Topix.net category.
  • RSS feeds are now available from our search results page, in addition to the 150k subject and location feeds.
  • Up to 7,000 sources in our news crawl.

Our goal was to create a more compelling news experience than the other aggregators and online news sites. Rather than simply averaging together the top stories from major news outlets, our NewsRank engine is applying a set of editorial rules to guide the story selection process.

We want to de-homogenize the news selection; instead of averaging down, we want Topix.net to find and bring back the most interesting, compelling (and sometimes the oddest) stories from the deep corners of the web. Stories that won't show up on other sites.


Categorized Aggregation is Hard

Topix.net has an aggregated feed for every ZIP code in the US (and every country in the world), as well as hundreds of thousands of other subjects -- health conditions, sports teams, industries, and so on. How do we do it?

Not with human editing, source tagging, or keyword scanning. The Topix.net NewsRank engine is reading each story individually, determining locality and subject information based on the content of the article. NewsRank also condenses 17 dimensions of importance from every story into a single value.

Categorizing sources in order to produce topic aggregations doesn't work. Susan Mernit writes a great blog about online media, but she also writes about food and other personal topics. Blindly adding her entries to a food or media industry aggregation would result in inappropriate posts showing up.

Source-based categorization doesn't work for local, either. The San Francisco Chronicle runs stories that aren't about San Francisco. Conversely, there are many stories about events in SF that show up in news sources based outside of San Francisco. These stories would be missed with source-based tagging.

Keyword-driven filters are also a poor solution. Pulling every story out of the news stream with "San Francisco" in it will not make a good SF rollup, but instead will yield a random jumble of posts, most of which merely mention "San Francisco", but overall have nothing to do with it:

... on a business trip to San Francisco, ...
... an unrestricted free agent from San Francisco, ...
... was bound from Alaska to San Francisco in the winter of 1860 ...
... moved, with her family to San Francisco in 1960, ...

The situation is even worse if the keyword is ambiguous ("Kerry", "Bush", "Springfield").

Our solution is to disambiguate references to people, places and subjects, and match them against our Knowledge Base of 150,000 topics. The result lets our algorithmic story editing technology leverage a much finer-grained idea of what a story is about than simply using the big 7 news categories (US, World, Business, Sci/Tech, Sports, Entertainment, Health.) We can bias up Olympics coverage while slighting movie reviews. Some pages on Topix.net are programmed to slightly favor sensational stories, others to de-emphasize the lurid.

Our complete news system -- article crawler and extractor, story clustering engine, NewsRank determination, topic and locality categorizer, the Topix.net Knowledge Base, and the algorithmic editing system (the "Robo-Editor") comprise the most sophisticated algorithmic news editing system on the net. It's by no means finished though -- so please keep the feature suggestions and bug reports coming and we'll keep improving it. :-)

Update: More on Topix.net's new algorithmic editorial algorithms can be found in this Cyberjournalist article.