A Database for Project Gutenberg E-text Files

Our company Controller takes control of the E-texts on the 2000 CD InfoBase.

By Wayne Kneeskern

For the past five years, in addition to my Controller duties at Thaddeus Computing, I've worked as the Copy Editor and proofreader. Most of my proofreading has focused on the printed issues of The HP Palmtop Paper and Pocket PC (formerly Handheld PC Magazine). With the Internet becoming such a big part of our business I have been putting more and more time and attention on our Web sites and related electronic materials. This includes the CD InfoBase that we have produced since 1996.

Each year we have included, as a special bonus, E-text files from Project Gutenberg (PG). In 1998 we discovered that there were more files than we could fit on one CD, so we created a separate CD with just the PG files. For the 2000 CD InfoBase there wasn't room to add the 1999 PG files on the bonus CD so we put them on the primary, compact disk.

While copy editing and proofreading the 2000 CD InfoBase I began to realize just how much information is on these two CDs. Not only are there enough program files to keep me busy for the next several years, but there's also all the past issues of The HP Palmtop Paper and PC In Your Pocket. But the part that intrigued me the most was the wealth of reading material available from Project Gutenberg. For those of you not familiar with Project Gutenberg you can get more information at their Web site http://www.gutenberg.net or go to http://promo.net/cgi-promo/ pg/cat.cgi

Getting Organized

As an accountant, I like to have things organized so I can find them when I want them. I discovered that all the CD InfoBase offered was a list of the PG files for each year. If you didn't want to browse through the list of 1800 plus files you could use the "Find" feature of the CD InfoBase to search for the title or author's name of a document. But if you were looking for all the works of Charles Dickens, for example, you would end up back at the previously mentioned list with his last name highlighted 84 times. So you still had to browse through the list to find what you were looking for. Even after finding what you wanted to read there is no way to hot-link to the work itself. You would have to put in the Project Gutenberg CD, look for the directory the work was in, open it, and then find the file listing and double click on it. Hopefully you wrote down all the relevant information while you were looking at the list on the CD InfoBase.

For me, searching for a particular document on the Project Gutenberg Web site wasn't that much better. The PG Web site gives a page with a complete list of all titles or authors but there are no links from the titles to the document itself. The Web site also lets you browse by title or author. In this case you can take your choice and then get a list of the alphabet and choose the letter of the title or author you want to look for. By clicking on D in the author section you can scroll down through the authors until you find 3 separate groupings of Charles Dickens' works. By clicking on a title under his name you can bring up the text of that work to read.

None of the above options provided a simple method of identifying a document. For me, the ideal solution to the problem would be something as easy to use as the old, card files that the public library used to have.

Project Project Gutenberg

I decided it was time to take on a new project in my spare time. The CD InfoBase Project Gutenberg files needed to be organized for use on the palmtop. The question was how?

I arbitrarily decided to limit my solution to one of the HP 200LX's built-in applications. That meant I had four possible answers to choose from: Lotus 1-2-3, Database, NoteTaker and PhoneBook.

In my accounting duties I use spreadsheets all the time. I especially like the feature of a spreadsheet that lets me sort items on different columns. By organizing the PG files into a Lotus 1-2-3 spreadsheet I could sort through all the works from Project Gutenberg any way that worked best for me. I also had some basic knowledge of how to set up a custom database of my own.

I decided try a Lifeline and use a "50-50." Now I was down to deciding between the Lotus 1-2-3 spreadsheet and a custom Database.

Again I felt I needed some help so decided to "Ask-the-audience" which in my case meant discussing it with our in house "experts."

Still not completely sure how to proceed I opted to "Phone-a-friend." Tom Gibson was the Technical Editor of The HP Palmtop Paper and also did a lot of the work in compiling the PG files for the CD InfoBase.

A custom Database was "my final answer."

The Project Gutenberg Project: a Custom Database

Creating a custom database on the HP Palmtop is relatively easy. The User's Manual contains enough information to get you started. Several articles in The HP Palmtop Paper pick up where the User's Manual leaves off. The biggest problem is in getting the database just right. Choosing the fields to include and getting them lined up in a usable format can be tricky.

To make my custom database I created the following fields for each document: Title, last name and first name of the Author, File name, Directory, Category, and Notes. Using these fields I would be able to sort and group the data anyway I wanted. For example, I could have an alphabetical listing of all titles; I could group titles by author's last name, or find all the titles that were contained in a certain category. The category field was the hardest to define. I kept adding to the categories as I entered the data for each document. I ended up with categories for Historical Documents, U.S. Government Documents, U.S. Historical Documents, Reference, Religion, Music, Classics, GIF Images, Motion Pictures plus zz (a catch all I used for titles I didn't know how else to classify at the time).

Entering Data into the Database

After deciding on a format for the database, I decided that I wanted to have a working knowledge of each of the files in the Project Gutenberg section of the CD InfoBase. Hence, I didn't "copy and paste" from a computer generated list of files. Instead I chose to enter all information into the database the old fashioned way: type it in word for word and character by character.

I realized that this would be a lot of typing; but one of our latest Palmtop products, an external keyboard, proved to be most helpful. (see www.PalmtopPaper.com/cart/shop/kbd.htm). Let it be known that the keyboard is a great way to solve the typing problems on the 200LX. It sure made my job a lot easier and as a touch typist it was easy to adapt to the size and spacing of the keys.

After hours and hours of data entry I finally had all the information entered into my database. Now the editing and proofing started. This involved comparing the file name entries in the database to the file names listed on the CD. I had made some mistakes in my data entry work and was able to correct those. But the major problem I found was there were file names listed in the database that were not on the CD. For some unknown reason they either did not get downloaded from the Project Gutenberg Web site or were overlooked when files were copied to the CD. In any case I went through the database one by one checking each file against those actually found the CD. Rather than delete them from the database (so the database and CD files would correspond) I changed the Directory name of those not found on the CD. (Those not on the CD were labeled in the Directory field with the year they were translated by Project Gutenberg and the number assigned to the work; i.e., 99-1774.) Now when I looked up a Title and saw that Directory name I knew I wouldn't find it on the CD but would have to go to the PG Web site to read that document.

Another thing that may be unique to my database is how I listed the titles of each of the documents. First, I took out any "A", "An", and "The" from the beginning of a title name. Second, when I found documents that I felt should be grouped together I would write the title in such a way that they would be listed one after the other when sorted by the first name of the title. For example, to keep all the Tarzan books together I had to list the titles "Tarzan, Return of" and "Tarzan, The Untamed" to keep them together with the main title "Tarzan of the Apes." Another example is the numerous inaugural addresses of our various presidents. Instead of using their names first I started each title with "Inaugural Address," and then appended their name.

The great thing about a custom database is that anyone and everyone can modify it to his or her own liking. So now that the basic information is all organized, you can take this database and change it to your liking. For example, Ed Keefe added four more fields to his version of the database: one check box for documents he'd Read and one for Missing documents along with two date fields (Start and End) to indicate when he'd started and finished reading a document. He also included a couple of subsets, Author and File Name. Each of these subsets contained all the records in the database but they displayed the list of records sorted by Author and by File Name. This strategy let him return to the default view of the database with the click of a few buttons rather than the laborious process of resorting the records in a spreadsheet.

There is no reason you can't take my database and customize it to your needs.