Digital Archive on Paper (Part II)


For my thesis, I am creating a body of computer verified mathematical proofs. For posterity, I would like to preserve the sources of my proofs with my thesis.

Previously I have attached a files to my PDF documents and stored them in electronic archives. But this electronic storage is still fragile and only available to people who get an electronic copy of my thesis.

Presumably thesis are printed on high quality acid-free paper for a reason. I feel that my computer proof should last at least as long as my paper thesis. Thus I am looking for a way to place my digital proof into my paper thesis.

One can encode digital data on paper simply by marking paper with a sequence of symbols, such as black or white squares. What I am looking for is some standardised encoding mechanism. My requirements are an open and freely specified standard, reasonable error correction, large data storage, and support for multiple pages.

The state of the art in digital storage on paper is 2D barcode symbology. The major open standards for 2D barcodes are PDF 417 (ISO/IEC 15438), Datamatrix (ISO/IEC 16022), and QR code (ISO/IEC 18004). These 2D barcodes are usually used for marking parts and other objects with identifying information that can be read by hand-held scanners. As such, the standards only allow storage of limited amounts of data. One PDF 417 barcode can store up to 1 100 bytes of data, one Datamatrix can store up to 1 556 bytes, and one QR code can store up to 2 953 bytes of data. A single barcode is far to small to store my data; however, these 2D barcode standards provide ways to link multiple barcodes together (at cost of adding some header information). Datamatrix and QR code allow up to 16 barcodes to be linked together into one file. Datamatrix allows for 216 file ids, while QR code does not seem to have any file id information. A protocol of PDF 417 called Macro PDF 417 appears to allow up to 99 999 barcodes to be linked together, and allows an arbitrary string to be used as a file name.

The compressed sources of my proof of Gödel’s first incompleteness theorem require 146 008 bytes. Ideally I would like to store the entire lambda expression representing the proof. This could require up to 2 megabytes of storage. I would expect the data to be decoded only if the electronic resources are no longer retrievable. In that case I would expect the pages to be scanned with something like a flat bed scanner. My requirements are malaligned with the goals of these 2D barcode standards. The technology closest to my requirements appears to be Xerox Parc’s proprietary Dataglyphs format. Dataglyphs are intended to be stored in the background of a page, or hidden in an image. However even Dataglyphs appear to be limited to 64 kilobytes per page. There is no standardized way of linking together multiple Dataglyphs; however Rakhi Motwani’s master’s thesis describes a protocol for linking together multiple Dataglyphs. The process simply packetizes the data into equal sized packets, and optionally adds parity pages so that data can still be recovered if pages are lost. Because Rakhi Motwani treats the Dataglyphs encoding/decoding process as a black box, her method could be used with any method that encodes a limited sized block of data into a paper, such as series of linked 2D barcodes.

Rakhi Motwani’s use of parity pages is a little sad. The Dataglyphs format already (somehow) has error correction built in. My limited understanding of coding theory tells me that our codes can correct for any choice of percentage (say 18%) of errors in the data, no matter how those errors are distributed. Thus if Dataglyphs supported multiple pages natively, there would be no need for parity pages because even if 18% of the pages are missing, the data could still be recovered.

On the other had, even the 2D barcodes do not seem to use error correcting codes in the above manner. The data is often divided into blocks, so that each block can only be recovered if each blocks of data has less than 18% errors. It is still possible to lose your data if you have less than 18% corruption in your barcode if all the errors occur in a single block of data. I don’t understand why they do this. Usually the reason to carve data into blocks is because you want to read part of your data at a time (such as when playing a CD). Because I intend to store compressed text data, it is fine to require that the entire data be read before error correction and decoding is done.

The best option I see is to use PDF 417 and link together a large number barcodes to encode my data. Since PDF 417 probably uses error correction of blocks of data, I’m inclined to ignore the problem of missing pages.

Anyone have any other ideas? Should I study coding theory and consider writing my own ad-hoc format that will become an international standard for storing ancillary data in journal articles? How large are your spreadsheets of experimental data, and how big is the source code for your computer science research?


, ,

Russell O’Connor: contact me