January 18, 2004

Fast crawls by search engines scary but fun

by skrenta at 7:04 PM

Over the past week we've had many search engines and spiders visit topix.net. Generally spiders will rate-limit themselves to visiting a particular domain no more than once every 30 seconds. However, for a large site like Yahoo, Geocities or dmoz, this means that it could take half a year to finish indexing the whole site.

But search engines want to have the freshest data, and webmasters want to be indexed as quickly as possible, so a few advanced crawlers will detect if they are visiting a very large site, and speed up dramatically if they sense that the site can handle the traffic.

We observed this first hand the second day after our launch. Googlebot was the first to show up, and quickly accelerated to about 1 hit/second. Teoma arrived and spent half a day fetching 30,000 or so pages. But then AltaVista's spider Scooter arrived and really fetched up a storm. They were fetching well over 5 pages/second at the peak. I thought for a minute it was DOS attack until I saw that it was just AltaVista indexing us. :-)

Fortunately we've built a wicked-cool page serving infrastructure, so our servers didn't even break a sweat. Load on one peaked at 1.14 with 75% cpu idle. Not bad for a pair of Supermicro 1U Linux boxes. We haven't even added the planned third front-end server to the cluster yet. At this rate we may not need to for a while and can hold it back as a hot spare in the rack.