Long Term Storage of Electronic Data and Documents

Thomas D. Schneider, Ph.D.
Frederick, MD

This document explains:

It has become a practice recently to send email containing attachments. These attachments are often encoded by the mime method, a technically better replacement for uuencode in that it allows for binary data to be transmitted reliably by email.

Attachments that contain word processor documents are, however, different. It is the practice of software companies to record these documents in a binary format. This serves two important purposes for them:

These same features are to the distinct disadvantage of the user. Documents written in word processor formats may be impossible to convert to other formats even if the conversion programs are supposedly available. At my work, we attempted to convert a document written in Italian to a format that will last. After spending an entire afternoon using 2 Sun workstations, 2 Macs, a PC powerbook, disks and ethernet for transfer, three of us, who were well experienced in all of these systems, were unable to succeed. Apparently the word processor binaries are not compatible between different computers (i.e. PC vs Mac).

The practices of the companies have the effect that the user is tied not only to the particular word processor program, but sometimes also to the computer that they wrote the document on. Transfer of the document for archival purposes becomes increasingly more difficult as the old computer becomes obsolete. That is, the documents may be easy to create, but they will not last 20 years or more. Anyone who has used a word processor from a now defunct company or computer knows that unless they converted the documents before moving on, the documents are lost.

Because I would like to be able to read email from a long time ago, I cannot store word processor binary formats. Furthermore, if someone sends me such a format, it would force me to buy the product. While I'm sure the computer companies like this, it is inappropriate, perhaps even unethical, for people to be forcing me to buy poorly designed software.


Example 1: wasteful storage
bytes document type fold difference
6633 text/plain 1
17277 application/wordperfect 2.6 fold larger!
53949 application/msword 8.1 fold larger!
This is data on a document someone sent me in 2001. It makes it clear why I call I call the second processor 'wordY'. It is wasting an enormous amount of disk space everywhere you send the message. If you can do the messages in plain text or HTML it would be more efficient. If you want to do something fancy, send just an abstract and a URL, that's even smaller!

Example 2: wasteful storage
bytes document type fold difference
6600 html 1
177664 .doc 26.9 fold

Example 3: A Schedule, 2001 June 15
bytes document type fold difference
746 itinerary.txt: text/plain 1
5718 itinerary.wpd: application/octet-stream 8 fold larger!
27667 itinerary.doc: application/msword 37 fold larger!

Example 4: Zapped Equations
Someone sent me a scientific paper that they had printed in their version of wordY. The equations were impossible to understand, making me think that the author was a nut case. Working with another person who printed the same paper out, I learned what had happened. The times symbol was printed as a double headed arrow, The alpha symbol became an underscore, and the sigma symbol disappeared entirely. Thus the same wordY document does not print the same way for two people! Based on this experience, I suggest that no scientist should ever risk using wordprocessor formats!
  • Platform that worked: IBM Thinkpad 1480i running MS Windows 98, 2nd edition, 4.10.2222A. Microsoft WORD 2000 v 9.0.2720
  • Platform that failed: Macintosh Microsoft word, version 8 with the application program being Microsoft Office 98.

Example 4: A Letter, 2003 January 12
bytes document type fold difference
62344 original base64 encoded email -
46080 letter.doc wordy file 1
1276 letter.txt: text/plain made using openoffice.org 36.1 fold smaller
1183 letter.txt: trimmed by hand 38.8 fold smaller!
This person is wasting 97% of their disk space on pure binary JUNK! The email was 55 times larger than necessary!

Example 5: A Schedule for a Speaker, 2004 April 6
bytes document type fold difference
19968 speaker.doc 1
266 speaker.txt 70 fold smaller!
This person is wasting 99% of their disk space on pure binary JUNK! This is the current world record.

Superior Alternatives to Proprietary Word Processors

There are several alternatives available. For email the accepted standard is pure ASCII. That is, a text-only format. While this may seem primitive, it will last. Further, it serves many purposes adequately. Any documents that one wants to keep for a long time are best stored in an ASCII format.

How can one store complex documents for a long time without worrying that the company will go out of business or make the document obsolete? There are two ASCII based formats that are particularly good. The older one is called TeX or, in the more convenient form of TeX, LaTeX. (The even older troff and nroff are still used by some people, but they are not as good as LaTeX.) In this format one types commands such as \emph{emphasis in the form of italics will be generated from this}. Learning such commands is not as hard as people sometimes imagine, but they are far more powerful than a GUI (graphical user interface) because they allow the user to make up new commands and are fast because one does not need to move the mouse through long menus. After typing the commands, one puts the text through a converter that produces beautiful typesetting. This is TRUE typesetting; it runs circles around what word processors can do. It has been used to typeset entire books. LaTeX is used around the world.

A common complaint about LaTeX is that it is not a WYSIWYG. That stands for What You See Is What You Get, but it often is BNWYW: But Not What You Want! I have been frustrated by a commonly used word processor that---compared to LaTeX---could not make an equation beautiful, which LaTeX does very well. But the complaint is valid: rapid feedback on the results of typesetting is useful. Some time ago I invented a program called atchange. This program simply watches one or more files. When a file changes, atchange will execute any series of commands that I want. It only takes 10 seconds to set up the command, but atchanges uses them hundreds of times. This means that I spend less time moving the mouse and much more time doing work. Since computers have become so fast, atchange allows LaTeX to become a WYSIWYG (and you get what you want). I work in a simple but ergonomic editor (vi or vim) and when I type one key - a comma - the file is written out. Atchange notices this and calls the commands - including LaTeX - to typeset and display the text. On a 200 Mhz machine, a 50 page technical paper can be typeset in about a second. As computers get faster, the time becomes negligible. So one gets the best of both worlds - a fully programmable powerful typesetting language that uses ASCII (and so will last) AND a WYSIWYG.

Another feature of LaTeX that is incredibly nice: automatically formated bibliographies. I just type "\cite{Shannon1949}" and the bibliographic entry is put into the paper in the right place and all references throughout the paper are altered automatically. It is incredible to see people still struggling with these things when such a powerful tool is freely available. I have set up atchange to redo the bibliography automatically whenever a new entry is put in the paper OR when a new entry appears in my reference database.

The second ASCII based format is HTML (Hyper Text Markup Language), the language that is used to create pages on the world wide web, such as this one (which I typed in vi). Like TeX/LaTeX, HTML has commands for defining how a page is to be typeset. Unfortunately it is not the same as LaTeX and so some powerful features were lost. I believe that in time HTML and LaTeX will be fused or a third language that covers both will emerge. A conversion program has been written that takes LaTeX to HTML (latex2html). Because both languages are in ASCII, it will always be possible to write conversion programs reasonably easily. Since the Netscape program can be told to go to pages or to refresh a page, in combination with atchange one can have a WYSIWYG for HTML. (Further information is on the atchange page.)

Since HTML has rapidly become a widespread standard, it seems reasonable that email in HTML should be acceptable. Although purists object, at least HTML is readable without a web browser and so, if done carefully, does not come out as pure garbage when viewed without a program.

There will be one time that we can predict will cause trouble for ASCII format, and that is when computers begin to use Unicode. ASCII has 8 bits per character and the high order bits can cause trouble if one uses them in text files. Unicode has 16 bits and so allows much larger character sets; Unicode 2.0 contains 38,885 distinct coded characters making it a truly international standard. At some point it will be necessary to transform ASCII documents to Unicode, but conversion programs should be easy to create since all they have to do is map each ASCII character to its Unicode equivalent.

TeX, LaTeX, HTML and atchange are FREE. There are people around the world working to improve them. Physicists and mathematicians use TeX and LaTeX because of their wonderful ability to typeset mathematical equations and scientific notation. For some reason biologists have lagged behind.

Recommendations

Resources: News:


This page was written entirely with my personal resources on my own time. No governmental funds, equipment or electrons were used.


Tom Schneider's Home Page
origin: 1998 Feb 9
updated: 2008 Mar 07