November 6, 2005

Topix Tags Blogs

by skrenta at 1:56 PM

Today we added 15,000 top weblogs to the crawling/tagging engine. Blog posts are being categorized into our 30,000 local feeds as well as our 300,000 subject feeds. Our search results now include blog results, and posts should show up on our site and search index within 1-3 minutes of being crawled.

News vs. Blogs

There's been a lot of talk about whether bloggers are journalists. At Topix, we can ask a slightly different question -- Are blog posts news?

Others are doing a great job of providing relevant keyword search against blogs. But our mission is to discover the news within the sea of blog posts, and report them by location and subject. What we're releasing today is our first step in connecting our readers to 15,000 more voices talking about the topics they care about.

While Memeorandum and Digg are approaching the same problem, we needed a solution that would scale to our 300,000 newsfeeds.

Coverage: MSM vs. Blogs

We were curious how the breakdown of posts by topic from blogs would differ from mainstream media, and were blown away by the contrasts:

Adding these 15,000 voices to the conversation is a big win.

A note about the number of real blogs out there... There've been reports that there are 20 million weblogs, even one report said there were over 100 million. This is one of those cases where statistics can be very misleading. While the total number of unique feeds that have ever existed, or blogging accounts that have ever been signed up can certainly be counted, what is far more relevant to us is the composition of the daily posting stream. What we're seeing is that 85-90% of the daily posts hitting ping services such as are spam (take a look for yourself). Of well-ranked non-spam blogs that we've discovered, we've found about half haven't been updated in the past 60 days. Our filters sift through what's left, which even after discarding 95%, is still a great deal of good material.

Inside the Box

How did we judge which blogs to add? We started by crawling about 1M blogs, and then began automatically filtering and ranking these using our NewsRank algorithms -- which consider a variety of factors, such as blog posting frequency, writing style, type of reference, popularity, and so forth. We ended up adding the top 15,000 sources that passed these tests.

The graphs above reference postings from these top 15,000 blog sources, and our 12,000 main stream media sources. Taking them together, we think this is the first time anyone has ever summarized the subject matter for that conversation everyone keeps talking about.

Stopping at the top 15,000 was an arbitrary cutoff for this first release. Frankly this started as an internal experiment; we had no idea how well our engine would work on a large volume of blog material, but the quality of the posts we saw was so great that we decided to just launch the blogs today over the objections of our marketing staff. :-) We will continually add more sources and our goal is to push toward automated coverage of 1M sources.

Some topix channels where bloggers really add to the experience:

Blogs and news are now on equal footing on We're visually highlighting blog posts on our pages for the moment so you can tell the new material from the main stream media posts, but consider the the current display a beta, in all likelihood this won't be the final UI. But we'd love to know what you think...

If we're not crawling and indexing your blog, we'd love to know about it. Please use this form to submit your feed for inclusion in our index.

November 4, 2005 now appearing at a theater near you...

by tolles at 5:42 PM

If you're interested in working with us here at, we're always up for getting together -- and there are a couple of places to catch up with us in person outside of our offices, as well. We're jazzed that we've been invited to present at a few great events over the next quarter -- specifically Mike Markson (VP of Business Development) will be at the Kelsey Group's Interactive Local Media Conference, and I will be at Search Engine Strategies and Syndicate.

All of these are great places if you're looking to talk with us about what we're up to on the business side, as well as being some of the best events out there to find out about the latest and greatest in the search, media and content areas.

The Kelsey Group's Interactive Local Media Conference
Newspapers 2.0
Reston, VA November 30

Search Engine Strategies
Meet the News Search Engines
Chicago, December 6th

Knight Ridder Digital Case Studies:
Leveraging Content in New Ways

San Francisco, December 14th

Oh -- and for those of you looking to talk to Rich -- He's kind of busy right now helping build stuff here. More about that in a little while...

November 1, 2005

We're Going to Need a Bigger Boat

by tolles at 7:27 PM

Fred Wilson relates a series of quotes, evenutally positing that we're facing an Attention Crisis.

It's a good point, especially if you're funding (or building, or using, or particpating in) technology companies that are, essentially, monetizing your attention through advertising. The issues are slightly different for each stakeholder here -- but the issues Fred raises apply to all of us.

When we were running the Open Directory at Netscape, an employee of mine, Jim Rainey, phrased this issue very eloquently --

"Time is the one thing that doesn't scale"

Fred goes on to point out that we're all reaching (have reached) the saturation point with regard to information flow -- How many feeds can you subscribe to?

We've been here before.

When the web was a small collection of sites, you could use bookmarks to keep track of things...later Yahoo built a hand edited collection of sites...and eventually, when the web reached the 1B+ mark, you needed automation to help you find what you were looking for...

Now, Fred's pointing out that the Incremental Web of new, up-to-the-minute information is increasing faster than he (or any other reasonable person with a life) can keep up -- and that something's got to give. As Greg Linden notes in the comments of Fred's post --

I think it's become clear that it's impossible to manage the flood of information manually.

And, I'll point out, that as the number blogs goes from the millions to the tens of millions, discovery and management of information relevant to you is a problem that's accelerating in its increase in difficulty

But -- all is not lost -- in that restatement of the problem above, lies the answer -- whether you prefer or Findory, or Google News -- you're going to need help managing your information, and looking at where the problem is going to be two years out -- it's going to involve quite a bit of automation.

Newsreaders, and reading lists are great -- just like bookmarks and web directories are great -- and all of these are part of the solution. But we have a point of view about this at -- that there's a great opportunity to help people with this "looming attention crisis" through building products that can scale to the Incremental Web. I've gotten grief from people about the humans vs. robots thing, but it's just SO clear that you're going to need some freakin' COMPUTERS to address this issue.

Automated tagging by topic of every news story from over 12,000 sources is the measure of our current effort -- there's a lot more work to be done here, but it's great to see other folks here start to understand the problem ahead.

So, with regard to the Bigger Boat in the's going to need an engine, not oarlocks.