Teaching Machines to Read

Written language is a persistent tool for information communication. Many cultures develop it somewhere during their history. A written language allows groups to share records, instructions, and cultural heritage across many years (some examples of written language held by Special Collections & Archives date to 2000 BCE). The modern world still relies heavily on written language for the transmission and storage of data, information, and knowledge. Academic communication is still largely conducted through the world’s written languages, and the internet is an incredibly complex mechanism for moving text between computers around the world, whether that text be in the form of tweets, emails, or webpages.


Over this long history humanity has refined written languages to indicate the nuances of the data being transferred or stored. Bibliographic information (such as title and author) are often set apart from other text, whether that be physically as a title page in a book or typographically with differences in type sizing and decoration. We as readers have learned that large type placed above bodies of smaller type is usually a title. Likewise we recognize the white-space in prose and poetry as paragraph, line or stanza breaks.

These intuitions are not explicit to computers. Strides have been made in natural language processing (as this focus of computer science is called), but it is not perfect enough for robust analysis. The Text Encoding Initiative Consortium (TEI-C) is a group of scholars and institutions interested in joining the computational powers of modern computers with the study of “manuscripts, research papers, historical archives, early printed books, linguistic corpora, anthologies, critical editions, ancient inscriptions, and a wealth of other literary, historical, and cultural material”.

TEI encoded text
To this end they developed and continue to refine the Text Encoding Initiative Guidelines (TEI). TEI is a suite of technologies designed to make explicit to computers what is implicit to us as readers. The fundamental part of TEI is a markup language expressed through eXtensible Markup Language (XML). TEI encoding is similar in form to HTML. A key difference is HTML provides a computer with instructions for text display, TEI provides instructions for text comprehension.

Visualization of the changes to a Burroughs sentence through drafts.
By providing a computer with a marked-up text, and the rules for precisely interpreting the markup, a computer can access a text in a similar way as to our human readers’ intuition. This level of access also allows a computer to process a text quickly in ways that take humans a long time and a lot of intellectual effort. Traditional bibliographic tools such as index and concordances (which can take humans months or even years to create by hand) can be completed in seconds by a computer with a properly encoded text.

Calendar of the Week July 12 1905
Except from the Egyptian Gazette.

TEI has become a valuable tool in the developing field of Digital Humanities. Florida State University’s (FSU) faculty is using it to open texts up to new kinds of analysis. Dr. Will Hanley, Assistant Professor in FSU’s Department of History, uses TEI to analyze persons and places in colonial era Egyptian newspapers. TEI has the capacity to assign unique identifiers to personal and topological names. Doing this for several regional newspapers for several years allows a scholar to extract names and contextual information from the texts. Techniques from network and graph theories can then be applied to the data, uncovering insights that were buried.

Dr. Stanley Gontarski (Robert O. Lawton Distinguished Professor of English) and Dr. Paul Ardoin (Ph.D. FSU 2014) have used TEI to examine the works of Samuel Beckett and William S. Burroughs. Literary analysis is typically done on final published works. Though for some writers it is difficult to establish a simple canonical version of a text. William S. Burroughs, for example, published several different versions of the same text over the length of his career. TEI is used in this case to examine a text as it moves through all of its different versions: manuscripts, typescripts, and variant published versions. Each version serves as a witness to a larger critical edition constituting the full body of the work. An outcome of this work is the Burroughs Archive.

The Burroughs Archive home page.
TEI is also used to structure information so it can be accessed by visualization tools. The University of Victoria with support from the Social Sciences and Humanities Research Council of Canada built the Map of Early Modern London (MoEML). This is a digital edition of the 1561 Civitas Londinum or “Agas Map” as it is more commonly known. Underlying TEI structures such as Placeographies and Personographies drive color-coded filters that highlight different features of the map. The value of the MoEML is not only that it extends the ability of the public to access a rare and valuable map, but also that it provides an interactive layer to further its use for research.

A political cartoon.
A political cartoon from Politics in Graphic Detail.

Another example of the visualization power of TEI comes from the Historical Society of Pennsylvania. The Historic Images, New Technology (HINT) project resulted in a new tool for accessing some of the society’s visual collections. Political cartoons are especially sensitive to a loss of temporal context: politicans change, parties rise and fall, and issues evolve. Understanding this history is often integral to understanding the messages of the cartoons. Politics in Graphic Detail uses TEI to add an annotation layer to the display of political cartoons. Even novices can glean much of the historical context of the cartoons by hovering over the appropriate sections of the image, lowering the barrier to academic inquiry.

If you are interested in getting started with TEI, a good place to begin is TEI by Example. It offers some tutorials and exercises to help you learn the structure of the mark-up language. More adventurous users can visit TEI-C for the full TEI guidelines, including instructions for constructing a custom TEI schema. TEI-C is also the portal for the Journal of TEI and TEI-L mailing list. These are both good resources for seeing how other scholars are using TEI in their research. Finally, for those interested in just having some TEI documents to experiment with, the Perseus Digital Library hosted by Tufts University has a vast amount of ancient texts encoded in TEI for download.

