October 29, 2004
We've been measuring on-site vs. off-site story clicks for a while,
and have noticed a steady increase in the number of users who are
consuming at 4:12 AMTopix.net
newsfeeds via RSS. Taking the last week's logs
shows our RSS usage at 12%.
Recently we instrumented our feeds to track story clicks per client.
This is a better measure of actual use than simply counting RSS feed
fetches, since it measures user clicks on stories in our feeds rather
than robot activity.
Caveats: this is only measuring Topix.net feed usage, so extrapolation
to overall market share is probably not sound. Also, I'm puzzled as
to why Bloglines isn't showing up higher in our stats. We have a healthy
count of subscribers reported from Bloglines in our server logs, but for some
reason they don't seem to click on our feed items very often. My Yahoo's numbers took a big jump when they took their new RSS module out of beta, but have since leveled off somewhat. Technorati doesn't seem to be crawling our feeds, so they're missing from this picture as well.
October 27, 2004
John Battelle at 2:58 PMpointed out
the latest Digital World report from Mary Meeker
, where she outlines the developments of RSS, Blogs, and online advertising. This is a good read for anyone interested in the web-publishing business, whether you are a self-publisher or mainstream web site.
It was especially interesting to us that one of the feeds that she subscribes to through My Yahoo (you can see this in the examples in the report) is the Topix.net Boston Red Sox feed. As a former NY'er, I am obviously troubled by anyone who lives/works in the Big Apple is subscribing to news about the Red Sox - especially after the debacle otherwise known as the 2004 ALCS. But with that said, I'm still happy that she chose a Topix.net feed to get her BoSox news. :-)
October 6, 2004
at 1:46 AM
Bill Gross invented the ad model that is Google AdSense, which supports search and much of the rest of online media now. He was mocked for it at the time. Now he's back, unveiling his new search engine Snap at the Web 2.0 conference in a jaw-droppingly cool 15 minute whirlwind demo.
I remember Snap when CNET owned it and they had to change their red exclamation point to a darker period because Yahoo complained. Somehow Bill Gross has gotten ahold of the domain. He always manages to get the best domains.
Snap has taken a terabyte of user session data secretly recorded from ISP backbones and used the post-search user behavioral info to rank site experience in multiple vectors.
Snap looks like it returns an Excel spreadsheet for your results. You can click on the columns to re-rank based on various dimensions, which may be search term dependent. If you search on something Snap knows about, like a product category, it will give you two spreadsheets, the first being a slick DHTML spreadsheet of price/features info.
My high-level on this is that he's inserted a price comparison shopping layer above product searches (the most valuable category of searches that users enter), and has implemented something akin to Andrew Goodman's idea to let users individually optimize search results themselves. And thereby have the engine learn from their input, and overall be far more spam resistant.
He's also smartly differentiating Snap from Google with total transparency. You can see all the stats on the site: how many searches they get, how much money they make, what people searched on, what advertisers are paying, and more.
Up until two months ago, search was "done" (again) and meant a text-entry box and two buttons on a white page. Then we got MyJeeves, A9, and My Yahoo Personal Search. These are the opposite of spartan interfaces, instead opting for filing-cabinet features to appeal to power users. And now Snap raises the bar even further.
Is the search market segmenting? Via Battelle, we know that search power users account for most searches. Maybe Google is the AOL of search, limited to trying to be all things to all people with a two-word entry format, but the high-end users (the ones who buy lots of expensive stuff online) will graduate to more sophisticated interfaces.
October 4, 2004
Industry canon seems to favor RDBMS storage for everything. Logs, centralized user reg, even whole freakin' web crawls. Maybe this is a legacy from Oracle's phenomenal marketing machine that rolled over the valley in the 90's, which eventually (through skillful FUD to VCs I was told) required even little startups to pay enormous sums for databases licenses. Now that great databases like mysql are free, coders are enjoying the luxury of SQL without the heavy price tag.
at 11:59 PM
But I'm not much of a database guy. The lesson I took away from watching the horror of Netscape's UREG database being down for two weeks after a RAID enclosure failure was that even fancy databases, expensive hardware and knowledgeable staff weren't a substitute for fail-safe KISS architecture. Folks at a small startup I knew that was acquired by Excite were expecting to finally get some help debugging their sick, monster DB; they were horrified to find Excite's internal systems in even worse shape than their own. A shopping engine I once knew had a nightmarish flow of chained databases, with a slow-boat-to-China 24 hour dataflow through the system end-to-end. The mess was so big, complex and expensive it was impossible to replicate in QA, meaning that testing occurred on the production system (with predictable results).
Studying Unix internals at USL left me with a desire to always optimize apps down to the syscall level (how many bytes could safely be appended atomically to a file per write in SYSV, hmmm). I like flat files, with operations set up so access is fast & failsafe.
Live servers must never wait for disk to serve a page, should try to avoid talking to sockets, and the only safe storage operations are write-with-append and rename(). Never use NFS for anything, mmap is your friend (thank you Google for helping get the kernel bugs out), and design your system so that you can cycle power with zero corruption. Locking is a last resort; locks wreck performance, and are a waste if your app has last-writer-wins
semantics anyway or you can use append & rename().
In other words, Hotmail's architecture rather than eBay's. You can bet that Matt Wells is using some creative data structures in Gigablast.
A side effect of this approach, aside from reliability, is that systems built this way tend to be vastly more scalable.
Woz talked at Gnomedex about how being cheap made his designs better. Cheap leads to less parts, which means higher reliability. We use serial ATA and IDE raid. We used to have 3Ware cards driving the RAID, but the 3Ware card turned out to be a point of failure, so we just use the straight Linux RAID software now. Very cheap, very high performance.
Even better is to get rid of the need for RAID at all. If you have to replicate CPUs for high availability anyway, toss out the RAID on each and figure out a live mirroring system. You'll lose redundant spindles, and make the whole system cheaper and more reliable.
Everything on Topix is served from big mmap files made up of compressed data chunks. This supports thousands of hits per second per machine, is infinitely scalable, and we can update all 150,000 site pages every few minutes simply by pushing new wad files to the front ends.
Our search backend isn't quite as cool, involving some legacy code, but it's getting there.
The largest factor in architecture, however, isn't how many machines you need, it's how productive the coders can be extending the system. Fortunately dead-simple architecture tends to be highly productive to code on. With less overall moving parts in the system, there is less mystery, a faster learning curve for new folks, fewer places for bugs and unexpected states to hide, and less lines of code that have to be maintained for a given component.
I've come across other flat-file and KISS adherents, but they're rare. I was told once that the VCs made Filo and Yang buy the Oracle licenses, but they left the software on the shelf, preferring instead to deploy simpler systems built from scratch on BSD. Those guys were smart. :-)
Also check out this amusing article from Smart Money in 2000:
"Google actually built its own database from scratch, and it's a wholly different type of software, called a 'flat file' database, according to Craig Silverstein, Google's director of technology."
Those guys are smart too. :-)
at 9:40 PMAlexa
hasn't changed in so long,
I had to blink and rub my eyes to see if I was in the right place.
Alexa now offers tracking of not only "rank" on their graphs, but also
now separately graphs "reach" and "pageviews". These were formerly
available in the tables, but were hard to visualize, especially when
comparing sites. The new options make it easy to see whether a site
gets its rank from a large number of page views per user, or with more
uniques but less PVs from each session.
October 2, 2004
Here at Gnomedex there's a tremendous amount of excitement and buzz
about RSS. But in my conversations about Topix.net's news crawling,
I'm finding some misconceptions about how widespead RSS syndication
is among traditional online publishers.
at 11:20 AM
Only 7% of the sources Topix.net crawls have XML feeds. I'd estimate
that only a few hundreds of the top 3,000 newspapers we crawl have RSS
support. The rest we obtain with a news crawler which is good about
finding articles on news sites, leaving behind the ads and navigation
sidebars. It's low maintenance so we don't have to change anything
everytime a site redesigns its html.
Even for sites which offer feeds, we'll generally continue to crawl
the human-readable version. We've seen sites where the RSS broke but
no one at the paper seemed to notice, or cases where the RSS was out
of sync with the human-viewable web content. By crawling both we
get full coverage of the content available.
There are approximately 1,400 weekly newspapers in the US, and over
2,600 weeklies. There are around 3,000 magazines, and thousands of
radio and TV station websites. Not to mention the city government
websites we crawl looking for local announcements.
Despite the enthusiasm around RSS, there is a long way to go before
the bulk of this content will be available in feeds.
at 10:27 AM
Chris Pirillo's Gnomedex definitely leads the pack with schwag.
Besides the google glowing cups and blinky pins, which on the
bartender's suggestion I'm saving for the kids for halloween,
Chris raffled off what seemed like hundreds of prizes last night.
Yours truly even won a dolby 5.1 PC sound card. :-)
October 1, 2004
at 4:37 PM
Topix.net will be appearing at several events this fall:
Web 2.0 Conference, October 5-7
I'm co-hosting a session with Stewart Butterfield from Flickr on
"Dialing on the App Tone: How the Early Web OS is Shaping Up."
Kelsey Group Interactive Local Media Conference, Nov 3-5
Mike Markson from Topix.net on a panel: "Where Do Verticals Fit In?"
Accelerating Change 2004, Nov 5-7
I'll be on a panel about natural language search, and will chat about
"Text Analytics for News".
WebmasterWorld's World of Search, Nov 16-18
News Search panel.
If you'll be at any of these events, stop by and say hi.