With a title like Metadata Librarian, you would expect I know a lot about library data: the standards, the structures, and the use of data in library technology. I do! But I also have days where the data I know so much about becomes a beguiling and frustrating adversary.
One aspect of my work is getting descriptive metadata out of our systems so we can use it for purposes other than search and display. These uses might be for reports, to investigate certain issues, or sometimes just to make pretty pictures. Getting the data out is frequently more troublesome than I ever expect it to be.
By this point, I have a pretty solid grasp of the coding tools, API’s, and data feeds that make the data accessible, but whenever one of these projects come up, I have a habit of under-committing on the time it takes to get usable data together. This causes stress for me, and likely annoyance with my colleagues as I struggle to delivery the information in a timely matter.
What makes this so difficult? On paper I have everything I need. Well, the perfect data project relies on assumptions that typically aren’t true.
Data gets messy. People make typos. Software might add unnecessary formatting information. System migrations introduce a whole cavalcade of unexpected issues.
Here are some of the note labels we have in our ArchivesSpace finding aid database:
Here you can see messy data caused by inconsistent use and typos in note labels. If I were trying to gather all of the notes with biographical information in them, I would have to target at least three different note types.
Unfamiliar data structures
Speaking of note labels, I made the assumption in my work that every note was going to have a label by which it could be identified. It made sense given all the examples I was reviewing used note labels.
The old adage about assumptions rings true, when the assumptions you’re making are against a familiar data structure. This came to light when data we could clearly see in the database was not appearing in the final exported data result. Turns out the note label was an optional parameter of the note data block. If I really wanted to identify and collect all notes of a specific type I needed to target the required note type parameter instead.
Inconsistent data model
Finally, I want to introduce the concept of the data model. A data model is an expression of the way data is organized, and how pieces of data relate to one another with in the system. In ArchivesSpace, the text content of a note is contained in a content parameter on a subnote child of the full note. It’s an odd arrangement, but one that can be managed so long as it’s applied consistently.
You might see where this is going. Most notes kept their text content in a child subnote. Collection abstract notes (along with a handful of others) store their text content right on the parent note. Managing these inconsistent scenarios adds a level of complication and a subsequent investment in time than I initially could have anticipated.
This is just a small sub-set of the possible pitfalls I encounter when programmatically accessing and formatting data. With large data sets, it’s not possible to check every data point for possible issues, so the process ends up being iterative rather than linear.
There’s no straight flow from the data source to the data product. It’s a wash and repeat process of generating a data output, checking and refining for issues, and starting from the beginning again. I want the process to be fast and linear, but I need to remember and budget for the more time-intensive cycles of iteration.