Wild Metadata

Jonathan O'Donnell

"Google automatically discovers the existence of the file and automatically indexes all the words in the document.

This makes the process of entering new information on the Web much easier, but means that the search for a document is quite naive."

Author not cited, 'Metadata', a slide from an Open Access Workshop

Metadata versus the Web

This graph is drawn from the PDF of David Hawking talk of 30 May 2005, "Poor search facilities cost money - is metadata the answer?" at DC-ANZ 2005.

I have referenced it in my Web page on wild metadata. This page provides a larger version of the graph and some further information about its contents.

graph showing the performance of metadata vs standard Web indexing information

The graph shows the results of 398 queries run on the target Web site, an anonymous Australian university.

The queries were run against nine different versions of the Web site data:

The nine results are graphed against the mean reciprocal rank of the first best answer (MRR1), from the first ten results. So, for example, if one of these techniques scores 50%, it means that the best answer is found at rank 2 out of 10, on average. If it scores 100%, the best answer is found at rank 1 on average. If it scores 0%, the best answer was not found in the first ten results. A high percentage score is good, a low percentage score is bad.

Looking at the graph, the approximate results are:

There is a red line across the graph that marks the mean of the mean reciprocal rank of the first best answer (MRR1). It represented the mid-point between all the results. In some ways, it is similar to the average of the best and the worst results. It is sitting at 23.3%. All the results are sitting below it, except for anchor text and Web mix, which soar far above it.

"Subject and description" and "content, title, subject and description" are scoring about 10 - 15 % on the right-hand scale. That means that the correct Web page is among the first 10 results about 10 - 15% of the time. Just searching anchor text will improve that result to about 50% of the time. Standard Web search engine techniques will improve the result to almost 75%, a enormous increase.