The joys of metadata normalization

The Canadiana Discovery Portal  now contains digital collections metadata from a substantial number of contributors. When Canadiana receives metadata from contributors, we normalize it as best we can and convert it into our internal Canadian Metadata Repository format (which we will be making public in the near future). It would be nice to think that we could simply write a handful of filters to convert from the most common formats (Dublin Core, MARCXML and so forth) to our own CMR format but the reality is that, even though most metadata follows one of these structural standards, the semantic content is different enough that we need contributor-specific and even collection-specific scripts.

The most common format we receive is unqualified Dublin Core. When it comes to simple descriptive and keyword searching information, such as title, creator, and description, DC metadata is straightforward enough. When it comes to control fields: extracting a unique ID, language and media type codes, publication dates, and a URL depends both on where the contributor has decided to put them, what conventions they follow, and how closely they follow them. So far, with a combination of XSLT and some post-transformation Perl hacking, we have managed to convert media types and document language designations into a standard set of codes with a surprisingly high success rate. Dates, where we can determine exactly which dc:date element is the publication date (or dates) can also be converted with high reliability, despite the variety of date formats we have encountered and the multi-century breadth of the collection.

With each collection we ingest, we are able to improve and unify our ingest scripts a little bit better. We seem to be approaching the point where most contributor's unqualified DC records can be converted to CMR using the same basic script, with just a small amount of per-contributor customization to specify where to find the metadata needed to generate the unique key, publication date, and URL to the resource. Because DC only specifies basic record structure, and even then it leaves a lot open to interpretation, there is a limit to how well data can be normalized across or even within collections, but our early results seem pretty encouraging.