In the first post in this digital preservation series, I shared some of the unique challenges digital material brings to the preservation game. In this one we will look at some of the technologies and tools digital stewards employ to protect our digital assets.
How can you tell when a computer file has been corrupted? If you try to open it funny, glitchy things might happen. How can you test whether a digital file is uncorrupted? This requires a bit more thought. Digital files are at their base-level a long string of 1’s and 0’s. This is called the file’s bitstream. Preservationists could compare one bitstream to an earlier copy of it, but this requires a lot of processing power for large files, with no guarantee that your comparison copy isn’t also corrupted.
This is where checksums can help us out. Checksums are character strings generated by a class of algorithms called hash functions or cryptographic hashes. You can try one out here: http://md5checksum.com/. Hash functions are used to encrypt lots of things. Passwords submitted to websites are hashed in your browser. Kind of like this:
Hash functions can also be applied to the bitstream of a file. Due to the nature of the various algorithms used even a single change in a one or zero will produce a drastically different checksum. If at the beginning of the preservation process a digital steward produces a checksum for the bitstream, she can now test for data integrity by rerunning the hash and comparing that output to the original checksum.
Now that we can test for unwanted changes in computer files, how can we ensure we always have a valid copy of it? A system called LOCKSS can help with this. LOCKSS stands for Lots Of Copies Keeps Stuff Safe. Similar to the idea of backing up personal files, LOCKSS will duplicate the files given to it and then distribute copies of files across several servers. The idea is to spread the system out over many servers in diverse geographic areas to minimize the risk of a single disaster (natural or otherwise) compromising the entire system. These distributed copies are then regularly hashed, and the checksums compared to test the validity of the files. If a checksum comparision fails, that server can delete it’s failing copy of the file, and ask the other servers for a new one.
Digital preservation is a rapidly developing field. New challenges requiring new solutions arise every day. In the third and final post in this digital preservation series, I’ll discuss activities you can undertake to protect your personal digital heritage.