In the first post in this digital preservation series, I shared some of the unique challenges digital material brings to the preservation game. In this one we will look at some of the technologies and tools digital stewards employ to protect our digital assets.
How can you tell when a computer file has been corrupted? If you try to open it funny, glitchy things might happen. How can you test whether a digital file is uncorrupted? This requires a bit more thought. Digital files are at their base-level a long string of 1’s and 0’s. This is called the file’s bitstream. Preservationists could compare one bitstream to an earlier copy of it, but this requires a lot of processing power for large files, with no guarantee that your comparison copy isn’t also corrupted.
This is where checksums can help us out. Checksums are character strings generated by a class of algorithms called hash functions or cryptographic hashes. You can try one out here: http://md5checksum.com/. Hash functions are used to encrypt lots of things. Passwords submitted to websites are hashed in your browser. Kind of like this:
Hash functions can also be applied to the bitstream of a file. Due to the nature of the various algorithms used even a single change in a one or zero will produce a drastically different checksum. If at the beginning of the preservation process a digital steward produces a checksum for the bitstream, she can now test for data integrity by rerunning the hash and comparing that output to the original checksum.
Now that we can test for unwanted changes in computer files, how can we ensure we always have a valid copy of it? A system called LOCKSS can help with this. LOCKSS stands for Lots Of Copies Keeps Stuff Safe. Similar to the idea of backing up personal files, LOCKSS will duplicate the files given to it and then distribute copies of files across several servers. The idea is to spread the system out over many servers in diverse geographic areas to minimize the risk of a single disaster (natural or otherwise) compromising the entire system. These distributed copies are then regularly hashed, and the checksums compared to test the validity of the files. If a checksum comparision fails, that server can delete it’s failing copy of the file, and ask the other servers for a new one.
Digital preservation is a rapidly developing field. New challenges requiring new solutions arise every day. In the third and final post in this digital preservation series, I’ll discuss activities you can undertake to protect your personal digital heritage.
In this digital world we are increasingly creating, storing, and publishing material entirely in electronic forms. While this is great for the trees and other resources used in making paper, it introduces new challenges in the process of collecting and preserving materials.
The preservation needs of paper are pretty well understood. Guidelines for ideal environments (heat and humidity) and practices (handling and storage) have been in constant refinement for hundreds of years. The libraries, archives, and information science communities only began thinking about preservation for digital material comparatively recently. This first post of a three part series on digital preservation will take a look at the challenges unique to preserving digital materials, and why we must approach digital preservation differently than physical preservation.
What might be surprising to many is the relative fragility of digital assets. Estimates put the average operation life of conventional digital storage media at five years. These failures occur in more than just the physical components: magnetic media are sensitive to anything generating a magnetic field from batteries to the sun! Optical discs can suffer from manufacturing errors or material degradation making them unreadable. Additionally, once damaged, a digital resource is often completely lost. Physical material might be salvaged through conservation. Recovering digital assets after damage is much more difficult.
Complicating the practice of digital preservation is the fact that digital materials are meaningless without the correct hardware and software environments to render them. Consider a printed book. The information conveyed by a book is encoded with ink marks made on paper. So long as the rules of the encoding language (that is, the language it is written in) and the marks on the paper persist, the information in the book can be recalled. The ink won’t independently leave the paper and reorganize into different patterns and structures.
This is exactly what happens to digital information. The long strings of characters encoding digital assets is only intelligible to a narrow band of both software and specific hardware configurations. Many of us have likely encountered the situation of being unable to open an old file in a newer version of software. Software developers are constantly adding and removing features to their products, often with little attention to backwards compatibility. Merely storing the digital encoding (or bitstream) is meaningless without also storing instructions on how to rebuild it back into an understandable, rendered product.
These extra considerations compound when you consider the speed of technological advances, and the new behaviors and interactive experiences we’re building and sharing with our machines and networks. Even identifying what behaviors and functions of digital assets are important to intellectual understanding of the resource is a quagmire. Those of us thinking about digital preservation have ceded a pretty large head-start, and the race is constantly accelerating.
In the next posting of this blog series, I’ll cover some strategies currently being used by the digital preservation community. I’ll finish this series with a post what you can do yourself to safe-guard your digital works and memories.
From Timothy Kanke who completed a year-long internship with Special Collections this spring:
The Florida State University Science Education Curriculum Development Collection documents the creation and development of science teaching materials produced by the science education program at Florida State University from the 1960s to the 2000s. A portion of the collection is composed of many different types of digital media containing completed games as well as various stages of code development. My internship focused on further exploration of what is in this portion of the collection and to research means of preservation and access to these materials. This collection provides many interesting challenges. It contains many different types of storage media including 3.5” floppy disks, CDs, micro cassettes, DAT, Travan, VHS, DVDs, EX drives, Jaz disks, laser disks, 8mm magnetic tape, and U-matic tape. Another challenge is that a majority of the material are stored on older file formats for Apple II and Commodore Amiga computers that are not directly compatible to most modern operating systems.
There are a few methods of moving old software to a modern platform. Both of the following methods create disk images that an emulator can read. One method is to directly connect the older machine to a clean workstation. For Apple II computers this is achieved with Apple Disk Transfer ProDos. This process has a specialized program installed on a modern operating system which is connected to an Apple II which has a modified version of its operating system. Another method is by using a specialized controller such as KryoFlux. It is a combination of a hardware controller (i.e. circuit board) and software. The original media storage device such as a floppy disk drive is connected to the controller. This allows access to the contents on a floppy.
This internship has also given me experience on conceptualizing a collection which shapes how the items are preserved and accessed. If the collection is to provide future generations the chance to experience educational software game play from the late 1980s to early 2000s, then the original equipment, or at least an emulator, is a necessity. If the collection is to document historical computer code, then the code itself and the accompanying documentation needs to be viewable. Or if the collection is focusing on the pedagogy, then a description of the gaming experience with the supporting printed materials might be sufficient for future research. The description of the game could be a walk through video. This kind of video interview captures the monitor output and the audio of the game as well as the words spoken by the player.
Of course choosing one conceptualization does not automatically exclude the others but a focus curtails unnecessary expenditures and brings to light what is truly necessary to preserve the collection. This allows the researcher to immerse themselves into the content and not be distracted by inconsistency of processing. It is impossible to predict how exactly a collection is going to be used in the future but a focus brings a cohesion to the individual items.