November 6, 2005

Topix Tags Blogs

by skrenta at 1:56 PM

Today we added 15,000 top weblogs to the crawling/tagging engine. Blog posts are being categorized into our 30,000 local feeds as well as our 300,000 subject feeds. Our search results now include blog results, and posts should show up on our site and search index within 1-3 minutes of being crawled.

News vs. Blogs

There's been a lot of talk about whether bloggers are journalists. At Topix, we can ask a slightly different question -- Are blog posts news?

Others are doing a great job of providing relevant keyword search against blogs. But our mission is to discover the news within the sea of blog posts, and report them by location and subject. What we're releasing today is our first step in connecting our readers to 15,000 more voices talking about the topics they care about.

While Memeorandum and Digg are approaching the same problem, we needed a solution that would scale to our 300,000 newsfeeds.

Coverage: MSM vs. Blogs

We were curious how the breakdown of posts by topic from blogs would differ from mainstream media, and were blown away by the contrasts:

Adding these 15,000 voices to the conversation is a big win.

A note about the number of real blogs out there... There've been reports that there are 20 million weblogs, even one report said there were over 100 million. This is one of those cases where statistics can be very misleading. While the total number of unique feeds that have ever existed, or blogging accounts that have ever been signed up can certainly be counted, what is far more relevant to us is the composition of the daily posting stream. What we're seeing is that 85-90% of the daily posts hitting ping services such as are spam (take a look for yourself). Of well-ranked non-spam blogs that we've discovered, we've found about half haven't been updated in the past 60 days. Our filters sift through what's left, which even after discarding 95%, is still a great deal of good material.

Inside the Box

How did we judge which blogs to add? We started by crawling about 1M blogs, and then began automatically filtering and ranking these using our NewsRank algorithms -- which consider a variety of factors, such as blog posting frequency, writing style, type of reference, popularity, and so forth. We ended up adding the top 15,000 sources that passed these tests.

The graphs above reference postings from these top 15,000 blog sources, and our 12,000 main stream media sources. Taking them together, we think this is the first time anyone has ever summarized the subject matter for that conversation everyone keeps talking about.

Stopping at the top 15,000 was an arbitrary cutoff for this first release. Frankly this started as an internal experiment; we had no idea how well our engine would work on a large volume of blog material, but the quality of the posts we saw was so great that we decided to just launch the blogs today over the objections of our marketing staff. :-) We will continually add more sources and our goal is to push toward automated coverage of 1M sources.

Some topix channels where bloggers really add to the experience:

Blogs and news are now on equal footing on We're visually highlighting blog posts on our pages for the moment so you can tell the new material from the main stream media posts, but consider the the current display a beta, in all likelihood this won't be the final UI. But we'd love to know what you think...

If we're not crawling and indexing your blog, we'd love to know about it. Please use this form to submit your feed for inclusion in our index.