Persistent Identifier Evolution

Sat, 10/29/2011 - 12:53 -- Gerald de Jong

It's one of those things that everybody knows but doesn't talk about nearly enough. Before linking cultural heritage objects, people, places, times ane events makes any real sense, each and every item must be afforded a permanent "location" where they can dependably be found. Unfortunately, to get there from here is a more difficult task than we might first imagine, because it is much more an ongoing process than a one-time thing.  There is no "Big Bang" for the introduction of most cultural heritage objects because there is no single place where they all originate.  By virtue of the way things have progressed up to now, it is often the case that digital cultural things have multiple "locations" already, so the only wise choice is to think process and to put in effort towards the goal without making incorrect assumptions about the starting point.

Our thinking at Delving has evolved a lot so far, and we have resolved to make this issue one of the challenges that we don't overlook or take for granted.  We know that we will have to adopt a process which systematically reduces the number of URIs (Uniform Resource Identifiers) in use since it is clear that the mutliple-origin aspect of cultural heritage metadata will inevitably attribute multiple URIs to the same actual items.  The problem is by no means solved solely through the use of redirect-based peristent persistent-identifer services like Handle, because although they can indeed play a central role, they do not represent the core of the process.  Despite what may seem paradoxical, we have to think about how persistent identifiers can evolve

To make the predicament clearer, let's look at a hypothetical scenario:

Two historical societies in the same province but separated by some distance in different cities are busy registering a stream of new cultural heritage artifacts and they know that it is important to identify things by pointing to lists rather than just typing in names which can be misspelled or misinterpreted.  Both have established the admirable discipline of building lists of authors or creators of works, and with every new object registered they attempt to choose from the existing list, only adding new members when necessary.  Some years later it becomes clear to both societies that their data should be unified for presentation on the internet so that people can explore the cultural heritage of the province as a whole.

Each of the societies has a list of creators, and it should be no surprise that there are some overlaps, since some of the creators moved from place to place during their careers.  Each member of each list has its own URI and since there are now two entries for some of the more famous people, these people can be said to have two persistent identifiers.  This is a natural consequence of the convergence of metadata, and we have to do something about it if we want to connect the works of a creator in one city with those in the other city.  Without this connection, any software-based navigation system will not be aware of the fact that these two people are in fact the same person.

So this might be the way that we can solve the problem over time, but it requires a two-way street in terms of data flow:

The two lists are combined into one when the datasets converge at the level of their aggregator, and uniqueness is maintained by adding origin identification to the URI while merging them.  This would mean, first of all, that each historical society would have to ensure that their own name is somehow integrated into every unique identifier used (as a prefix, for example) in these lists of creators, and that information would have to be stored in the metadata records themselves in their respective registration systems.  So everyone has to change or add to the identifiers they are using to make them unique in the big wide world rather than just locally.

Once the data is gathered together in one place, automated systems could try to find matches between different members of the list based on similarity of names and other associated data if it is available.  Matches can be found that way, but they cannot quite be considered trusted until a human domain expert has examined and approved them.  For this they will need a streamlined (not annoying) workflow and they will have to consider this kind of thing part of their jobs.

Also, once the combined data is presented online in a system which empowers users with UGC (user-generated-content) capabilities, it may be that citizens with laptops on their kitchen tables become an excellent source of these found matches.  Similarly to the automated case described above, the ultimate judgement of whether the matches are correct lies in the hands of the curator of the original dataset. Put another way, the user or the automatic data "cultivator" will offer an opinion about a match, but the organization responsible for the data can give it the more trusted opinion which starts the process of URI unification.

Making these matches concrete recorded things is called co-referencing, and it results in the accumulation of links among things.  It is important to keep in mind that a link is not made by any Creator of the Universe, but rather it will be made by a person or a clever automated process.  In both cases we would be unwise to ignore the source, so it becomes an essential element of each co-reference.

Somehow the data infrastructure must be able to coalesce URIs and reduce their redundancy.  Also, we have to ask where it is that the new URIs are to be located and maintained.  We believe that the URIs should be located as close as possible to their original owners, so this means that the data providers will have to be prepared to accept new URIs for some of their metadata.  This is the two-way street mentioned above and it can be done in a way that is least painful.  Each historical society in the example would have to adopt a new URI for something they used to consider their own, although they by no means have to immediately throw away their own identifier.  The new URI can be stored beside the old one until such a time that all references to it can be safey updated.  It would make sense to identify the new URI as being "primary" in some way so that the connections to and from the outside world are established.

So the challenges that lie before us are worthy of a thorough exploration:

Persistent URIs for the things of cultural heritage are not a given and not solved by redirects alone. They are not something that we necessarily already have and can just exploit, but rather they are a goal towards which we have to strive, and we would be wise to carefully review our assumptions and be prepared to build infrastructure that acknowledges the challenge rather than obfuscates it.  We have to think about learning to live with co-referencing and we have to work on building the two-way street for metadata and the streamlined curatorial workflow that will make data improvement a pleasant part of the day for cultural heritage professionals.  It's more of a direction than a destination.