It has been nearly two months since my interest in discovering our family history was rekindled. In this short period of time I have been fortunate to be able to trace farther back on the family tree than I thought was possible. Indeed, I already have a clearer picture of the family tree than any of my immediate family. My sources have primarily been historical record websites and firsthand accounts shared by living relatives, and it has quickly become clear that preserving and organizing the accumulated data is of paramount importance. For many reasons I have opted to store every historical item that I possibly can in digital format, but preserving family history records digitally is a much more complex and daunting task than it might first appear. Even with adequate long-term storage methods it will be very difficult to share the family treasures without a well-planned organization system. I will focus on digital preservation today and leave the topic of organization for a future post.
As I thought about the challenges and choices associated with electronically recording historical items I decided to focus on these:
- file formats and resolution
- standardized documentation
- ensuring accessibility at least 50 years into the future
- backup strategy
File Formats and Resolution
Obsolescence is of particular concern when it comes to file formats. For example, WordStar was a popular word processing application several decades ago but finding software today that can read documents created in the 1980s is exceedingly difficult. As the pace of technological change accelerates picking a file format that will still be readable in 50 years is a non-trivial matter. After extensive research of both government and personal websites I made the following choices:
- Photographs: TIFF (Tagged Image File Format) with image compression disabled, interleaved pixel order, and IBM PC byte order. Prints and medium format negatives will be scanned using a personal flatbed scanner. 35mm negatives or slides will be sent to a professional scanning service to be digitized. Color images will be saved with 48-bit RGB bit depth (16 bits per channel) using the Adobe RGB 1998 color profile. Black and white images will be saved with 16-bit Grayscale bit depth using the Gray Gamma 2.2 color profile. Images will be scanned at a resolution (DPI or dots per inch) high enough so that the long side of each image is at least 4000 pixels.
- Documents: PDF/A (Portable Document Format for Archiving) version 2u.
- Audio: WAV (Waveform Audio File Format) uncompressed and recorded at 44.1 kHz/24 bit.
- Video: QuickTime or AVI (Audio Video Interleave).
When storing records digitally it is important to add descriptive information (a.k.a. metadata) to make it easy to retrieve documents in the near term and to provide your posterity with valuable clues that explain the context, historical relevance, and original source of each item. At a minimum you should record contextual and historical information that describes what the record is (e.g. marriage certificate, photograph of a specific person, etc.) and the source from which the record was obtained. If the information is known you should also identify the creator of the record, and if applicable, how it was digitized. While it is possible to store metadata within some digital files (e.g. IPTC or EXIF) this capability is not available in all formats and is not obvious for an individual that might stumble across your archive in the future. As a result, I opted to store metadata in simple ASCII text files that are likely to be readable for at least 50 years if not longer. I may write an application in the future that reads and indexes these metadata files so the text will be structured as XML to allow for programmatic reading and writing.
- Each image or document record file will be named using in the format “FIDxxxxxxxxxxx – Description.yyy,” where xxxxxxxxxxx is a unique, incrementing identifier number and yyy is the file extension indicating the format (e.g. tif, pdf, jpg, etc.).
- A “sidecar” ASCII-based text file will be stored along with each digital record which contains the metadata.
- The Dublin Core metadata standard will be used and the text will be encoded as XML.
- The templates for the metadata files were created using the Dublin Core Generator tool created by Nick Steffel (link in the reference section below).
- An XML-based Microsoft Office Excel ® spreadsheet will be maintained that indexes all file names and includes a description for each. The content in this spreadsheet will be extracted from each metadata text file and is, therefore, not required to maintain the integrity of the archive. It is simply a mechanism to aid in finding digital records quickly. In the future I may develop a purpose-built application that will replace this spreadsheet-based indexing system.
Ensuring Future Accessibility
Although it is impossible to predict the future direction of technology it is worth the extra effort to ensure that your digital records can be accessed by future researchers. A target of 50 years was selected as a reasonable compromise between the exceedingly difficult and costly task of guaranteeing records can be retrieved for hundreds of years and the reckless option of assuming all of today’s formats will be usable even ten years from now. After all, spending countless hours tracking down and saving items of historical significance to your family would be a wasted effort if the next generation cannot easily continue your work.
The first step is to select commonly available file formats as described above to increase the probability that future software can read the files. Equally important is to migrate your files over time as formats become obsolete. My goal is to evaluate the file formats in use every 5 years and migrate whenever a particular format is falling out of favor.
Another major challenge that is often overlooked or unknown is the degradation of your storage media. All digital information is stored as a series of 1’s and 0’s where each digit is called a “bit.” Each file format (e.g. TIFF, PDF, JPEG, etc.) specifies a particular structure and algorithms used when digitizing information. Nearly every type of storage media decomposes overtime in a processed called “bit rot.” A change in or loss of a single bit can result in corruption of part of a text document or a flaw in an image (sometimes not even discernible to the human eye). Over time this bit loss may result in the file being completely unreadable. For some media like standard CDs and DVDs this can begin occurring after only a few years whereas for other archival quality media, like Millenniata’s M-DISC, it may take a decade or more. A link has been included in the reference section for those interested in learning more about the “bit rot” process. Needless to say, the guaranteed certainty that your precious archives will be subject to degradation and corruption, often without you even knowing, is justification enough to seriously investigate how to overcome this challenge.
There are manual methods that some organizations recommend for overcoming this random, unpredictable bit loss such as storing all files on archival M-DISCs or combing specialized software, checksum files, and multiple backups to periodically search for corrupted files and restore them from backups when detected. However, in my opinion, these suggestions are ill advised for two reasons. One is the administrative overhead and diligence involved for all but the smallest collections. For a digital archive totaling several terabytes it would require dozens of M-DISCs. Unless the directory contents are static updating such a large M-DISC collection as files are added, deleted, or renamed would be very cumbersome. The second is that this relies on a very fastidious curator to either keep the M-DISCs updated or to perform regular checksum validations across the entire archive and maintain a high integrity backup system should corruption be detected. Even as an engineer that has spent over a decade in the technology and consulting industry I do not feel comfortable putting this much dependance on human intervention to protect the records from bit loss. As a consequence I opted to create a purpose-built data storage system using the NAS4Free open source network attached storage (NAS) FreeBSD distribution and ZFS file system that would automatically protect the data integrity of my files. Details of how to procure, build, and configure such a solution is beyond the scope of this post, but if you would like more information leave a comment or send me a tweet.
Although all of us have experienced data loss due to a failed harddrive or accidental deletion or know someone who has it is typically treated like exercise. You know you should do it but for some reason it is always put off until later—typically when it is too late. One of the common approaches is to employ a LOCKSS (“Lots of Copies Keep Stuff Safe”) strategy. I essentially keep two local copies through the use of ZFS RAID-Z1 on my NAS4Free server and run a ZFS Scrub weekly to check for corruption. However, all that redundancy and bit loss prevention does me no good if a lightning strike fries the entire system or a fire breaks out. To truly protect yourself from such a catastrophe you need to back up offsite.
One of the challenges of backing up a family history or genealogy archive to an offsite location is that it can get quite expensive once you have several hundred gigabytes or even a few terabytes of data. After researching over a dozen different solutions (including the popular Mozy, Carbonite, and Amazon S3 offerings) I decided to go with CrashPlan+. The “Family Unlimited” option is currently $10 per month for up to 10 computers if you sign up for 1 year. The key is that it allows the backup of unlimited data from network attached storage. At $120 per year this is significantly cheaper than any other reputable provider when you factor in my need to back up several terabytes.
Resources and Further Reading
The full list of resources I used to come up with my preservation strategy is exhaustive and was not fully documented during the research phase. However, there are several web sites that I continue to consult and that I highly recommend whether you are just starting out or are looking to optimize your approach after accumulating large amounts of data.
- How to Digitally Archive and Share Historical Photographs, Documents, and Audio Recordings
- FamilySearch.org – Preserving Your Family History Records Digitally
- The Library of Congress – Preparing, Protecting, Preserving Family Treasures
- Dublin Core ® Metadata Generator
- Dublin Core ® Metadata Initiative
- Wikipedia article on Bit rot
- Wikipedia article on ZFS Data Integrity
- NAS4Free – The Free Network Attached Storage Project
- CrashPlan+ Overview
Note: Icons in this post licensed for use under CC BY-NC-SA 2.5 by Laurent Baumann.