Wild Metadata

Jonathan O'Donnell

"...the wild Columbine that sits near the edge of the woods..."

Norman Walsh, Summer Flowers and Metadata

The problem

On the Web, DC.description and DC.subject are not very effective finding aids when the full text is indexed.

The solution

Wild metadata, such as anchor text, blog descriptions and folksonomies may provide better description and subject (or keyword) metadata.

Podcast and demonstration

"tamed or cultivated: not wild"

WordNet, antonym of Wild

Background

At the DC-ANZ 2005, David Hawking convinced me that DC.Description and DC.Subject metadata aren't very useful finding aids when the full text of a Web page is indexed. He showed a comparison of searches based on subject and description metadata versus searches based on anchor text alone, and the anchor text search was just as effective.

David Hawking and Justin Zobel work on search and retrieval problems. A lot of their research ends up in Panoptic, a commercial search engine. They know their stuff.

They looked at the Web site of a large Australian university, one that mandates metadata for all pages, and has put a great deal of effort into helping people create metadata for their pages. Hawking and Zobel, in conjunction with staff at the university, tested different sample queries against various indexes of the Web site.

They looked at:

graph showing the performance of metadata vs standard Web indexing information

This graph, drawn from David Hawking's talk at DC-ANZ 2005, shows some of the results that they found.

Note the 6th column, "Subject and description" and the 7th column, "content, title, subject and description". They are scoring about 10 - 15 % on the right-hand scale. That means that the correct Web page is among the first 10 results about 10 - 15% of the time. Just searching anchor text will improve that result to about 50% of the time.

Anchor text and Web mix gave the best results, far exceeding any of the others. But it was scary to see just how poorly the subject and description metadata performed. It often gave worse results than just searching on the title alone, and only performed slightly better than searching the text of the URL.

Think about that for a minute. In most cases, you would be better off searching the title of Web documents than the subject and description metadata. You would get almost as good results by indexing the text in the URLs!

In their article, "Does Topic Metadata Help with Web Search?" Hawking and Zobel said:

"In summary, subject and description metadata performs worse than all other forms of evidence examined, other than the text of the URL."

"Given the large cost of creating better metadata, and the low benefits observed in our experiments, it is difficult to see how the investment in creating it can be justified."

"We conclude that topic metadata is of little value in processing web queries of the type that dominate enterprise query logs."

In their conclusion, they say:

"We found little evidence that metadata was of value for queries extracted from the query log for that site, even when the index was restricted to the central, well-managed site. For the most popular queries, metadata was superior to content but was inferior to alternatives such as anchor text... For all other queries metadata was outperformed by title and dramatically outperformed by other evidence."

David Hawking and Justin Zobel, Forthcoming, " Does Topic Metadata Help with Web Search?" Journal of the American Society for Information Science and Technology (JASIST).

They then checked their results against an Australian Government Web site, to check that the university was not an anomaly. They found no major difference in the government site.

As far as I can see, Hawking and Zobel are saying that topic metadata (DC.description and DC.subject) are irrelevant for enterprise-level Web searches. The combination of title, anchor text and content indexing is so powerful that it is an order of magnitude more useful than topic metadata. So much so that adding topic metadata into the mix made no appreciable difference to the results in almost all cases.

"wild: uninhabited area left in its natural condition"

WordNet, definition of Wild

Why DC.subject and DC.description fail

Hawking and Zobel found that only a little over half the pages (56.5%) had subject or keyword metadata. In almost half of those pages (43.6%), the only subject metadata was the name of the university. This is useful in distinguishing these pages from other organisational Web pages, but it doesn't help differentiate between pages within the University site.

In an article on "feral hypertext", Jill Walker points out that some people want to impose order on the Web, but it is, by its very nature, impossible to control. She says:

"The desire for discipline is evident in calls for systematically typed links, standardised metadata and a well-coordinated semantic web. Yet as sensible as these systems are, the web remains messy and unplanned. There are too many creators out there, and few bother to add metadata or follow standards. Even those who know the importance of metadata may fail to categorise their data in fear of failing to apply the taxonomy correctly. In addition, metadata is easily abused. Spammers have made metadata close to meaningless by adding irrelevant tags to their porn and gambling sites."

Jill Walker, 2005, " Feral hypertext: when hypertext literature escapes control", In Proceedings of the Sixteenth ACM Conference on Hypertext and Hypermedia (Salzburg, September 2005).

"wild: primitive state untouched by civilization"

WordNet, definition of Wild

What does metadata look like in the wild?

The fact that good results could be obtained from searching anchor text set me thinking. What other sorts of topic metadata are there? What does metadata look like in its natural state?

Aside from Web page authors, lots of people spend time indexing and categorising Web pages. They build links, write blog entries and tag pages in folksonomies. This metadata is wild - it is not crafted or controlled by the agency who created the page. It hasn't been commissioned and it represents a variety of world views. Individually, these pieces of metadata may not be very useful. In numbers, however, the irregularities begin to smooth out and the information may be as good or better than metadata written by a Web page author.

The quality will not be as good as trained librarians applying metadata via a standardised system and controlled vocabularies. It will, however, be as good or better than untrained people applying metadata to their own pages. It will also be better than no metadata at all.

Weblog entries

Web log (blog) entries consist of a link to a Web page and a short piece of text describing, or commenting on, the contents of that page.

Tags

The tags on del.icio.us and Flickr act like keywords. They are single words, attached to pages or images, that help to describe the content. They are used to group, or find, similar concepts.

They aren't foolproof. In her article on "feral hypertext", Jill Walker uses the example of the tag, 'bush', on Flickr. It is used to describe photos of plants as well as photos of US political events.

However, they are often the sort of terms that people use to find information.

Anchor text

Anchor text is the words that people use when they link to a Web page. They are the words that you click on when you follow a link.

For me, they sit somewhere between a tag and a blog entry. Often, they are a very short description of the page that they link to. And you can relate them unambiguously to a Web page, since they are pointing to that page.

The anchor text examples that really got me excited were the ones in foreign languages. Having people provide very short descriptions of your page (or site) in a foreign language seems to me to be a fantastic boon.

"wild: involving risk or danger"

WordNet, definition of Wild

How do you hunt for wild metadata?

A rough and ready method consists of finding pages that display anchor text, weblog summaries and folksonomy tags for a given page. Preference is given to pages that provide results in a well-formed XML format, as these assist the harvesting process.

Weblog entries

Most blog search engines will allow you to search for links to a particular URL. Often you will have to explore their 'Advanced search' function to discover how to do it, though.

Tags

del.icio.us will show you all the tags, and the comments, that people have attached to a given URL. The tags are shown on the right, with the more popular tags shown in larger text.

Anchor text

Both Yahoo! and Google will allow you to search for links to a given URL.

Preface the URL with "link:", so, for example, link:http://www.teara.govt.nz/. Note that there is no space between "link:" and the URL, and that you should specify the URL completely, including "http://".

Yahoo's Site Explorer will allow you to search for links to a single page, or a whole site. You can include or exclude internal site links. Unfortunately, neither will show you the actual anchor text.

"wild: talking or behaving irrationally"

WordNet, definition of Wild

How do you capture wild metadata?

It is all well and good to put metadata into a document. You have to be able to get it out again for it to be any use.

Several of these services, including Yahoo! and Del.icio.us, provide their results in RSS or Atom format. I'm not sure if there are DC.metadata harvesters that can parse RSS or Atom feeds as metadata. The possibility exists - I just can't point to an example.

They all provide APIs that will allow you to draw the data out, too. This opens up the possibility of automatically capturing and processing the information.

Weblog entries

Blogdigger is a Blog search engine. It will provide the results of a link search as an RSS feed. Each blog entry is marked as an item.

Each item consists of a title, link to the original article, description which contains the blog text, pubDate (publication date), source (name of the blog) and author.

Tags

del.icio.us is nice enough to use RDF and Dublin Core formats in its RSS feed. Each item includes a title, link, dc:creator, dc:date, dc:subject, and then the tags as RDF resources.

Anchor text

Unfortunately, I have not been able to find a way to access anchor text directly. Any assistance would be greatly appreciated.

However, several of these services will allow you to extract the information via their APIs.

I have developed a little demonstration that can be used to play with this idea. If you type in the URL of a page, you will get back some tags, some descriptions and some anchor text.

If it has been tagged on del.icio.us and blogged by someone who is indexed by Technorati, you will get back:

It works best with home pages. The del.icio.us part will only parse an actual page, not a site. The Technorati part will parse a domain name, but the service overall is limited by the del.icio.us part of it.

"wild: extravagantly fanciful and unrealistic; foolish"

WordNet, definition of Wild

Possibilities

So, even if you can find it and capture it, what is it good for. Here are some possibilities. I would be very interested in hearing more.

As a manual process

As an automatic process

"wild: intensely enthusiastic about or preoccupied with"

WordNet, definition of Wild

Advantages

"wild: located in a dismal or remote area; desolate"

WordNet, definition of Wild

Disadvantages

References

"wild: without a basis in reason or fact"

WordNet, definition of Wild

Thank you

I'm not sure if these ideas are in any way useful or practical, but it is interesting to think about the possibilities.

Thanks to Conal Tuohy for encouragement and suggestions, Jennifer Trant for Jill Walker's article, Feral Hypertext, Liddy Nevile for allowing me to present this idea at DC 2006, and Galdson for the music that I used in the podcast.