February 27, 2006

Every word in every document is already a tag

by skrenta at 8:20 AM

Back when web directories were still cool, AOL had an effort to build their own based on the Dewey Decimal System. They had 60 contractors in Arizona typing in web urls and assigning DDC numbers to them.

This didn't work. But why?

Because two thoughtful, non-malicious humans sitting next to each other will tag the same URL differently. (And, in this particular case, the most obscure URLs would default to more prominent positions in the DDC hierarchy, because they couldn't be classified.)

When you pick up the result of this exercise by a particular DDC number to get that category page, it's junk. It's missing a lot of stuff it should have, and it has stuff it shouldn't.

Before we had full text search of the world's knowledge at our fingertips, search systems would let you retrieve documents by keywords. If the item you were looking for hadn't been given the right keywords, it was undiscoverabale. "Internet Law?" "Software Patents?" "IP Theft?" Modern search systems consider every word or phrase in the document a tag.

Chris posted a rant about tagging here previously. I go back and forth on them.

On one hand tags work because they maximize participation with a simple user ask and the social use effects help rough standardization emerge around them.

But tags aren't a panacea, since they're excessively vulnerable to spam, and the items which should belong to the same categories will get different tags from different users. Which is it, "topixnet"? or "topix"?

They're uniquely valuable in a system like Flickr since photos don't have any text of their own to keyword search, so getting the user to add any searchable text at all is a big win. You can ask users to caption their photos but often putting just a word or two is easier so the participation level is higher.

But if you have the full text of the web, or blogosphere, or whatever, the marginal utility of the "keywords" tag on the document seems to be rather low. To deal with spam and relevance issues, the search interface for a large collection needs to be appropriately skeptical about what documents are claiming to be about.

It's great if you can get the user to enter additional metadata about their posts. But if you aren't already looking at the existing text you're missing a lot of pre-existing "tags".