If my years in and around relational databases have taught me anything, it’s that a search can only retrieve what the search engine is looking for.
Marked-up repositories (such as the entirety of what Google indexes and provides search results on), while very different from relational databases in many important ways, are not so different here.
Search engines search for what they have been told (in advance) will be there. They don’t search for what’s missing. (A feature we could use, quite frankly, but I wouldn’t begin to know how to architect it.) When the web was small, search engines just searched the contents of the web pages that were out there, and built algorithms to help with relevancy and context (Mercury the plant vs. mercury the element, for example). As the web grew, it became clear that the authors of web pages had to tell search engines what was in those pages – because searching the raw data was proving to be too time-consuming, and was generating inconsistent results (not good if you are trying to sell advertising alongside those results). Keywords for websites would help bolster those relevancy and context algorithms, adding weight to certain content – and if keywords weren’t used, those pages would find themselves at the end of a very long list of pages. Web developers used to do this by using the HTML “meta” tag.
There are two problems with this approach.
The first problem is that developers sometimes lie. These developers insert “meta” tags with keywords that have nothing to do with the topic of the web page, in the hopes of attracting traffic. This is why search engines have, for some time, ignored the “meta” tags. They are polluted with bad information.
The second problem is that there are billions of web pages. Any structured way for web developers to tell search engines what’s in their websites has got to scale to a nearly infinite degree. This means that it has to be easy, and it has to be easy to retroactively apply to already-existing web pages.
But if it’s too easy, then there is a risk of data pollution.
It’s a tense race, basically, between volume and honesty.
And it’s only going to get more tense as more websites are created, and more print resources are digitized and opened for search (this includes books). As many billions of web pages as there are now, there will be exponentially billions more. Possibly a googolplex.
Most of which are authored by people with good intentions who want to get their information found and contextualized; many of which are authored by people who really really want their web pages to show up in the top 3 or 4 for a given search term; some of which are authored by people of dubious character and even more dubious motivation, who assign tags like “Britney” to websites about onion farming.
The way the web reflects the noise and bumptiousness of human nature never ceases to amaze me. Whatever structure we invent to organize our communications (and humans are ridiculous communicators), it will be sabotaged. But that structure is, nevertheless, what we have. Without it, we are even worse off.