All posts by Matthew Miguez

The Year Without a Homecoming Queen

Serious drama began unfolding in October 1972 in the lead up to Florida State University’s homecoming week. As reported in the Florida Flambeau, students had an unconventional choice for Homecoming Queen: Ron Shank.shank

Mr. Shank’s candidacy for homecoming queen stirred up plenty of controversy. Administrators balked. Rallies were held in his support. Lawyers were consulted. In response to Homecoming Committee chair Dr. Marshall Hamilton’s comments that Shank was destroying the dignity of Homecoming, an commentor wrote:

On the contrary, Dr. Hamilton, we think Ron might enhance the dignity of Homecoming. He certainly appeared dignified in the picture in yesterday’s Flambeau. Such pomp and outright elan–if you will–certainly couldn’t harm the ceremony.

One person even wrote in to express his fears of a racist conspiracy being afoot; Mr. Shank merely being used as an excuse to cancel the entire homecoming court, which had had African-American Queens the previous two years.

Conspiracy or no, that’s exactly what happened. The Homecoming Committee declined to have a Homecoming court for the year of 1972, and University President Stanley Marshall elected to not get involved.

Let’s close with some words from the man himself:

I ran for Homecoming Queen because I don’t believe in parading Women (or Men) around on a stage under the auspices of a beauty contest when physical beauty is only a minor part of a person’s true beauty and definitely not the sole criteria of one’s worth.

Read all about the controversy in the Florida Flambeau in the DigiNole: Digital Library.

Protecting digital material: Strategies for digital preservation

In the first post in this digital preservation series, I shared some of the unique challenges digital material brings to the preservation game. In this one we will look at some of the technologies and tools digital stewards employ to protect our digital assets.

How can you tell when a computer file has been corrupted? If you try to open it funny, glitchy things might happen. How can you test whether a digital file is uncorrupted? This requires a bit more thought. Digital files are at their base-level a long string of 1’s and 0’s. This is called the file’s bitstream. Preservationists could compare one bitstream to an earlier copy of it, but this requires a lot of processing power for large files, with no guarantee that your comparison copy isn’t also corrupted.

This is where checksums can help us out. Checksums are character strings generated by a class of algorithms called hash functions or cryptographic hashes. You can try one out here: Hash functions are used to encrypt lots of things. Passwords submitted to websites are hashed in your browser. Kind of like this:miguezBlog003encryption.png

Hash functions can also be applied to the bitstream of a file. Due to the nature of the various algorithms used even a single change in a one or zero will produce a drastically different checksum. If at the beginning of the preservation process a digital steward produces a checksum for the bitstream, she can now test for data integrity by rerunning the hash and comparing that output to the original checksum.

Now that we can test for unwanted changes in computer files, how can we ensure we always have a valid copy of it? A system called LOCKSS can help with this. LOCKSS stamiguezBlog003LOCKSS.pngnds for Lots Of Copies Keeps Stuff Safe. Similar to the idea of backing up personal files, LOCKSS will duplicate the files given to it and then distribute copies of files across several servers. The idea is to spread the system out over many servers in diverse geographic areas to minimize the risk of a single disaster (natural or otherwise) compromising the entire system. These distributed copies are then regularly hashed, and the checksums compared to test the validity of the files. If a checksum comparision fails, that server can delete it’s failing copy of the file, and ask the other servers for a new one.

Digital preservation is a rapidly developing field. New challenges requiring new solutions arise every day. In the third and final post in this digital preservation series, I’ll discuss activities you can undertake to protect your personal digital heritage.

Digital Preservation: New Material, New Challenges

In this digital world we are increasingly creating, storing, and publishing material entirely in electronic forms. While this is great for the trees and other resources used in making paper, it introduces new challenges in the process of collecting and preserving materials.

The preservation needs of paper are pretty well understood. Guidelines for ideal environments (heat and humidity) and practices (handling and storage) have been in constant refinement for hundreds of years. The libraries, archives, and information science communities only began thinking about preservation for digital material comparatively recently. This first post of a three part series on digital preservation will take a look at the challenges unique to preserving digital materials, and why we must approach digital preservation differently than physical preservation.

Old hardware stored at the Claude Pepper Library.

What might be surprising to many is the relative fragility of digital assets. Estimates put the average operation life of conventional digital storage media at five years. These failures occur in more than just the physical components: magnetic media are sensitive to anything generating a magnetic field from batteries to the sun! Optical discs can suffer from manufacturing errors or material degradation making them unreadable. Additionally, once damaged, a digital resource is often completely lost. Physical material might be salvaged through conservation. Recovering digital assets after damage is much more difficult.

Complicating the practice of digital preservation is the fact that digital materials are meaningless without the correct hardware and software environments to render them. Consider a printed book. The information conveyed by a book is encoded with ink marks made on paper. So long as the rules of the encoding language (that is, the language it is written in) and the marks on the paper persist, the information in the book can be recalled. The ink won’t independently leave the paper and reorganize into different patterns and structures.

This is exactly what happens to digital information. The long strings of characters encoding digital assets is only intelligible to a narrow band of both software and specific hardware configurations. Many of us have likely encountered the situation of being unable to open an old file in a newer version of software. Software developers are constantly adding and removing features to their products, often with little attention to backwards compatibility. Merely storing the digital encoding (or bitstream) is meaningless without also storing instructions on how to rebuild it back into an understandable, rendered product.

These extra considerations compound when you consider the speed of technological advances, and the new behaviors and interactive experiences we’re building and sharing with our machines and networks. Even identifying what behaviors and functions of digital assets are important to intellectual understanding of the resource is a quagmire. Those of us thinking about digital preservation have ceded a pretty large head-start, and the race is constantly accelerating.

In the next posting of this blog series, I’ll cover some strategies currently being used by the digital preservation community. I’ll finish this series with a post what you can do yourself to safe-guard your digital works and memories.

Teaching Machines to Read

Written language is a persistent tool for information communication. Many cultures develop it somewhere during their history. A written language allows groups to share records, instructions, and cultural heritage across many years (some examples of written language held by Special Collections & Archives date to 2000 BCE). The modern world still relies heavily on written language for the transmission and storage of data, information, and knowledge. Academic communication is still largely conducted through the world’s written languages, and the internet is an incredibly complex mechanism for moving text between computers around the world, whether that text be in the form of tweets, emails, or webpages.

Over this long history humanity has refined written languages to indicate the nuances of the data being transferred or stored. Bibliographic information (such as title and author) are often set apart from other text, whether that be physically as a title page in a book or typographically with differences in type sizing and decoration. We as readers have learned that large type placed above bodies of smaller type is usually a title. Likewise we recognize the white-space in prose and poetry as paragraph, line or stanza breaks.

These intuitions are not explicit to computers. Strides have been made in natural language processing (as this focus of computer science is called), but it is not perfect enough for robust analysis. The Text Encoding Initiative Consortium (TEI-C) is a group of scholars and institutions interested in joining the computational powers of modern computers with the study of “manuscripts, research papers, historical archives, early printed books, linguistic corpora, anthologies, critical editions, ancient inscriptions, and a wealth of other literary, historical, and cultural material”.

TEI encoded text
TEI encoded text

To this end they developed and continue to refine the Text Encoding Initiative Guidelines (TEI). TEI is a suite of technologies designed to make explicit to computers what is implicit to us as readers. The fundamental part of TEI is a markup language expressed through eXtensible Markup Language (XML). TEI encoding is similar in form to HTML. A key difference is HTML provides a computer with instructions for text display, TEI provides instructions for text comprehension.

Visualization of the changes to a Burroughs sentence through drafts.
Visualization of the changes to a Burroughs sentence through drafts.

By providing a computer with a marked-up text, and the rules for precisely interpreting the markup, a computer can access a text in a similar way as to our human readers’ intuition. This level of access also allows a computer to process a text quickly in ways that take humans a long time and a lot of intellectual effort. Traditional bibliographic tools such as index and concordances (which can take humans months or even years to create by hand) can be completed in seconds by a computer with a properly encoded text.

Calendar of the Week July 12 1905
Except from the Egyptian Gazette.

TEI has become a valuable tool in the developing field of Digital Humanities. Florida State University’s (FSU) faculty is using it to open texts up to new kinds of analysis. Dr. Will Hanley, Assistant Professor in FSU’s Department of History, uses TEI to analyze persons and places in colonial era Egyptian newspapers. TEI has the capacity to assign unique identifiers to personal and topological names. Doing this for several regional newspapers for several years allows a scholar to extract names and contextual information from the texts. Techniques from network and graph theories can then be applied to the data, uncovering insights that were buried.

Dr. Stanley Gontarski (Robert O. Lawton Distinguished Professor of English) and Dr. Paul Ardoin (Ph.D. FSU 2014) have used TEI to examine the works of Samuel Beckett and William S. Burroughs. Literary analysis is typically done on final published works. Though for some writers it is difficult to establish a simple canonical version of a text. William S. Burroughs, for example, published several different versions of the same text over the length of his career. TEI is used in this case to examine a text as it moves through all of its different versions: manuscripts, typescripts, and variant published versions. Each version serves as a witness to a larger critical edition constituting the full body of the work. An outcome of this work is the Burroughs Archive.

The Burroughs Archive home page.
The Burroughs Archive home page.

TEI is also used to structure information so it can be accessed by visualization tools. The University of Victoria with support from the Social Sciences and Humanities Research Council of Canada built the Map of Early Modern London (MoEML). This is a digital edition of the 1561 Civitas Londinum or “Agas Map” as it is more commonly known. Underlying TEI structures such as Placeographies and Personographies drive color-coded filters that highlight different features of the map. The value of the MoEML is not only that it extends the ability of the public to access a rare and valuable map, but also that it provides an interactive layer to further its use for research.

A political cartoon.
A political cartoon from Politics in Graphic Detail.

Another example of the visualization power of TEI comes from the Historical Society of Pennsylvania. The Historic Images, New Technology (HINT) project resulted in a new tool for accessing some of the society’s visual collections. Political cartoons are especially sensitive to a loss of temporal context: politicans change, parties rise and fall, and issues evolve. Understanding this history is often integral to understanding the messages of the cartoons. Politics in Graphic Detail uses TEI to add an annotation layer to the display of political cartoons. Even novices can glean much of the historical context of the cartoons by hovering over the appropriate sections of the image, lowering the barrier to academic inquiry.

If you are interested in getting started with TEI, a good place to begin is TEI by Example. It offers some tutorials and exercises to help you learn the structure of the mark-up language. More adventurous users can visit TEI-C for the full TEI guidelines, including instructions for constructing a custom TEI schema. TEI-C is also the portal for the Journal of TEI and TEI-L mailing list. These are both good resources for seeing how other scholars are using TEI in their research. Finally, for those interested in just having some TEI documents to experiment with, the Perseus Digital Library hosted by Tufts University has a vast amount of ancient texts encoded in TEI for download.