The Work

Search is not complicated; it’s just hard

If my years in and around relational databases have taught me anything, it’s that a search can only retrieve what the search engine is looking for.

Marked-up repositories (such as the entirety of what Google indexes and provides search results on), while very different from relational databases in many important ways, are not so different here.

Search engines search for what they have been told (in advance) will be there. They don’t search for what’s missing. (A feature we could use, quite frankly, but I wouldn’t begin to know how to architect it.) When the web was small, search engines just searched the contents of the web pages that were out there, and built algorithms to help with relevancy and context (Mercury the plant vs. mercury the element, for example). As the web grew, it became clear that the authors of web pages had to tell search engines what was in those pages – because searching the raw data was proving to be too time-consuming, and was generating inconsistent results (not good if you are trying to sell advertising alongside those results). Keywords for websites would help bolster those relevancy and context algorithms, adding weight to certain content – and if keywords weren’t used, those pages would find themselves at the end of a very long list of pages. Web developers used to do this by using the HTML “meta” tag.

There are two problems with this approach.

The first problem is that developers sometimes lie. These developers insert “meta” tags with keywords that have nothing to do with the topic of the web page, in the hopes of attracting traffic. This is why search engines have, for some time, ignored the “meta” tags. They are polluted with bad information.

The second problem is that there are billions of web pages. Any structured way for web developers to tell search engines what’s in their websites has got to scale to a nearly infinite degree. This means that it has to be easy, and it has to be easy to retroactively apply to already-existing web pages.

But if it’s too easy, then there is a risk of data pollution.

It’s a tense race, basically, between volume and honesty.

And it’s only going to get more tense as more websites are created, and more print resources are digitized and opened for search (this includes books). As many billions of web pages as there are now, there will be exponentially billions more. Possibly a googolplex.

Most of which are authored by people with good intentions who want to get their information found and contextualized; many of which are authored by people who really really want their web pages to show up in the top 3 or 4 for a given search term; some of which are authored by people of dubious character and even more dubious motivation, who assign tags like “Britney” to websites about onion farming.

The way the web reflects the noise and bumptiousness of  human nature never ceases to amaze me. Whatever structure we invent to organize our communications (and humans are ridiculous communicators), it will be sabotaged. But that structure is, nevertheless, what we have. Without it, we are even worse off.

6 thoughts on “Search is not complicated; it’s just hard

  1. A search engine’s work is divided into two parts: finding a list of all pages that contain the search term and sorting that list of pages by “relevance”. The first part is (fairly) straight-forward but the second part is very hard and is the subject of much controversy. Search engines rely heavily on the notions of “authority” and “trust” that are imparted by other pages onto your page through the use of linking in order to establish this rank.

    I’m interested to see how these algorithms will evolve to include notions beyond “pages” and “sites” to incorporate other mediums like books that are referenced by entities like people. For example, if a respected person tweets something about a particular book being the best on a certain subject, then how should that tweet be weighted when someone searches on the subject?

  2. A search engine’s work is divided into two parts: finding a list of all pages that contain the search term and sorting that list of pages by “relevance”. The first part is (fairly) straight-forward but the second part is very hard and is the subject of much controversy. Search engines rely heavily on the notions of “authority” and “trust” that are imparted by other pages onto your page through the use of linking in order to establish this rank.

    I’m interested to see how these algorithms will evolve to include notions beyond “pages” and “sites” to incorporate other mediums like books that are referenced by entities like people. For example, if a respected person tweets something about a particular book being the best on a certain subject, then how should that tweet be weighted when someone searches on the subject?

    1. Agreed – the second part is in fact where the value prop of each search engine lies.

      I am very, very anxious to see books brought online in organic search results, and am doing everything I can to make that happen. But I am one person.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s