LJNDawson

Book publishing. And everything else.

Archive for the tag “Semantic Web”

Identifiers in everyday life

I talk a lot about identifiers. It’s my job. The esoteric identifiers – DOIs, ISNIs, ISTCs. The pragmatic ones – ISBNs. The other day I found myself in a meeting referring to URIs while the developers were talking about URLs (this is how you know you are either a geek or a purist jerk, or both – yeah, for 15 minutes I was “that guy”).

But outside of work, there are plenty of identifiers in our everyday lives – with varying degrees of “smartness” and “dumbness”. We’re quite comfortable with these, because we’ve grown up with them, and have to use them all the time, but when it comes to Big Data, they’re no different than any of the other numbers we talk about.

Social Security numbers are a good start. The first three numbers indicate the state where the SSN was assigned. The next two numbers are called “group numbers” – they group together the last four digits, which are issued sequentially. However! Some states were running out of numbers. So in 2011, the Social Security Administration began randomizing the assignment of numbers.

Phone numbers are another example of this. The first three numbers are the area code. The next three are the “exchange” – the local area of the caller. (Long ago, telephone exchanges were actually letters the caller would tell the operator, such as BUtterfield 8.) The last four numbers are randomly generated within the parameters of first the exchange and then the area code. However! Several phenomena have disrupted this system entirely. One is the rise of phone banks – the sheer number of telephone numbers that need to be assigned to these banks meant that new area codes had to be made up. The second is (or, rather, was) the fax machine. Having to assign a separate phone line to fax machines also meant that phone numbers were eaten up. The third, of course, is cell phones. This caused the greatest disruption of all – over time, people wanted to maintain their phone numbers regardless of where they lived. (My phone has an area code of 917, which used to mean Manhattan; it was assigned in 1997 when I lived in Brooklyn and worked in Manhattan – sixteen years later, I have maintained the same number even though I live on Staten Island and work in New Jersey.) Now phone numbers are essentially meaningless.

There are plenty of others – driver’s license numbers, passport numbers, license plates, EZ-Pass numbers, bar codes, numbers on shipping containers, Apple UUIDs. And with the Internet of Things,  there will only be more. As they proliferate, and as our circumstances change, the prefixes of these numbers will have less and less meaning inherent in them. Which is not a bad thing – identifiers are best when they are dumb. All they mean to say, of course, is “this thing is not that thing“.

Inside-out

Lorcan Dempsey, whose team at OCLC I admire like no other, had a great post last week that has stuck with me. I frequently say “shop=search” – that the storefront for media is essentially the Google search box. Lorcan describes this with regard to libraries, who do an amazing job curating and aggregating, but for a specific audience – library patrons and librarians. However, as he says,

[A]ccess and discovery have now scaled to the level of the network: they are web scale. If I want to know if a particular book exists I may look in Google Book Search or in Amazon, or in a social reading site, in a library aggregation like Worldcat, and so on. My options have multiplied and the breadth of interest of the local gateway is diminished: it provides access only to a part of what I am potentially interested in. As research and learning information resources have become abundant in this environment, the library collection and its discovery systems are no longer the necessary gateway for library users. While much of the discovery focus of the library is still on those destination or gateway systems which provide access to its collection, much of their users’ discovery experience is in fact happening elsewhere.

Second, the institution is also a producer of a range of information resources: digitized images or special collections, learning and research materials, research data, administrative records (website, prospectuses, etc.), faculty expertise and profile data, and so on. How effectively to disclose this material is of growing interest across libraries or across the institutions of which the library is a part. This presents an inside-out challenge, as here the library wants the material to be discovered by their own constituency but usually also by a general web population.

These factors shift the discoverability challenge significantly. The challenge is not now only to improve local systems, it is to make library resources discoverable in other venues and systems, in the places where their users are having their discovery experiences.These include Google Scholar or Google Books, for example, or Goodreads, or Mendeley, or Amazon. It is also to promote institutionally created and managed resources to others. This involves more active engagement across a range of channels.

This is an amazing articulation of a fundamental problem – libraries have extremely rich assets, and they are proprietary and in many cases closed, unavailable on the open web. Which is fine…but you can’t even see that they exist. Which is not fine. The web would be enormously enhanced if we could see what information is actually available (even if you have to set up credentials to log in, even if you have to pay).

It’s more than a question of engagement, and Lorcan is quite right to include the word “active” – it requires work. It requires structuring the resources in such a way that they can be mapped effectively to one another, or to bridge systems which serve as Rosetta stones, enabling a user to go relatively easily from one resource to another, or to search many resources simultaneously (the holy grail of “federated search” probably will be Google. Who knew?).

There are a lot of us who worked on these problems in other arenas – in my case, commercial bookselling on the web – who have a tremendous amount of experience that could be brought to solving this new iteration of issues. I’m really happy to be working on the bibliographic extension of Schema.org, with Richard Wallis, but there’s so much more work that needs to be done.

 

 

Search is not complicated; it’s just hard

If my years in and around relational databases have taught me anything, it’s that a search can only retrieve what the search engine is looking for.

Marked-up repositories (such as the entirety of what Google indexes and provides search results on), while very different from relational databases in many important ways, are not so different here.

Search engines search for what they have been told (in advance) will be there. They don’t search for what’s missing. (A feature we could use, quite frankly, but I wouldn’t begin to know how to architect it.) When the web was small, search engines just searched the contents of the web pages that were out there, and built algorithms to help with relevancy and context (Mercury the plant vs. mercury the element, for example). As the web grew, it became clear that the authors of web pages had to tell search engines what was in those pages – because searching the raw data was proving to be too time-consuming, and was generating inconsistent results (not good if you are trying to sell advertising alongside those results). Keywords for websites would help bolster those relevancy and context algorithms, adding weight to certain content – and if keywords weren’t used, those pages would find themselves at the end of a very long list of pages. Web developers used to do this by using the HTML “meta” tag.

There are two problems with this approach.

The first problem is that developers sometimes lie. These developers insert “meta” tags with keywords that have nothing to do with the topic of the web page, in the hopes of attracting traffic. This is why search engines have, for some time, ignored the “meta” tags. They are polluted with bad information.

The second problem is that there are billions of web pages. Any structured way for web developers to tell search engines what’s in their websites has got to scale to a nearly infinite degree. This means that it has to be easy, and it has to be easy to retroactively apply to already-existing web pages.

But if it’s too easy, then there is a risk of data pollution.

It’s a tense race, basically, between volume and honesty.

And it’s only going to get more tense as more websites are created, and more print resources are digitized and opened for search (this includes books). As many billions of web pages as there are now, there will be exponentially billions more. Possibly a googolplex.

Most of which are authored by people with good intentions who want to get their information found and contextualized; many of which are authored by people who really really want their web pages to show up in the top 3 or 4 for a given search term; some of which are authored by people of dubious character and even more dubious motivation, who assign tags like “Britney” to websites about onion farming.

The way the web reflects the noise and bumptiousness of  human nature never ceases to amaze me. Whatever structure we invent to organize our communications (and humans are ridiculous communicators), it will be sabotaged. But that structure is, nevertheless, what we have. Without it, we are even worse off.

We Interrupt This Lack of Blog Posts

I’ve been experimenting. As one might imagine, I’ve been experimenting with metadata and identifiers. On this blog.

Links to books now have DOIs. I’m kinda proud of that. I don’t know what it’ll get me – that’s the experiment. A DOI identifies a thing and the place on the web where it lives. So the ISBNs or ISTCs of the books I’m reading link to a location on Google Sites that I’ve set up. The Google Sites link further to Wikipedia entries, Amazon product pages, and other sites of interest.

I’ve also been experimenting unsuccessfully with semantic markup. I say unsuccessfully because this blog is in WordPress – which has a couple of markup plugins, but I’m still on the WordPress domain (this will change in a bit – so I will give my site a DOI and if you use THAT instead of the URL, you’ll always be able to find it even when the URL changes).

Why semantic markup? Because I am excited about proving the utility of rich snippets. For book publishers and retailers, this is going to be a critical way of differentiating your products from the mass of products that’s out there. God help me, I can’t find any pilot partners in the book industry, so I figure I’ll just do it myself. (If any adventurous publisher or book retailer wants to volunteer for a pilot, you know where to reach me.)

So that will happen, when I can get more time.

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 3,415 other followers