Aug 19 2008
Project Gutenberg continued…
After hand-formatting the first 50 pages or so of the Project Gutenberg version of “1001 Nights” I came across this neat project called Gutenmark that takes the plaintext and converts it into either html or LaTex-formatted documents for easier reading. I’m downloading LaTex now to edit said document, as that’s more more print-style layout and then I can export to PDF where I can read it happily. He has an awesome reformatting of Alice’s Adventures in Wonderland where he reformatted and re-inserted the original (public domain) illustrations (other texts he’s reformatted are here).
Here’s the comparison, with the top being the modified version in Sumatra PDF and the bottom being being the plaintext viewed in Firefox (click it to enlarge):
I was talking (complaining) to Bonnie about this – and about the essence of the Gutenberg project, which seems to be preservation of the written product with ultimate forward compatibility, hence plaintext. However, the text alone isn’t the true product – the layout, the formatting, the illustrations is the true product that should be preserved. You lose so much context and enjoyment if you can even get yourself through a plaintext version of the book. Layouts are designed for humans, while I think the plaintext was designed for machines. Accessibility, at least in public health-land, can be described as “the right services for the right people at the right time.” I think that perhaps for Project Gutenberg, accessibility’s right time is the future and right people are computers. I mean, it’s cool that they started this in the late 70′s with hand-transcribing texts(!) on mainframes, but the average person isn’t going to really enjoy these materials – they’ll check out Google Books and book scans, which I beef about further down…
That being said, as this is for humans to read, why doesn’t Project Gutenberg also create a nicely-formatted PDF version for download? They already support a more readable format for pocketpc-like devices, and they’ve sort of started this by having html versions, but in a pdf reader, where you can set it up to view facing pages like a real book, only a PDF will really do. It could be a final step in their review project they do with Distributed Proofreaders. Plus, it’s a great opportunity to overshadow the book-scanning projects of Google Books and what-not. The book scans aren’t “clean” for individual reading (both in font crispness and general page quality), though I think they have their place in a very purist preservation sense. These newly digitized and proofed copies give you an electronic basis for producing a pdf, and are much easier on the eyes – that’s why, I suppose, when I get a e-book copy from a publisher it’s not a scanned copy of the printed book! Plus adding in the original (if publicly available) illustrations would give some new life to these older books and increase readership. And the Gutenmark program usage is truly painless – it took about 10 seconds to do the 1001 nights first volume, which is about 600 pages A5.
The goal to me is this intermediate point on the continuum of fully digital (plaintext) and fully analog (book scans): human readbility and appeal. Throw in “now” and you have my take on what the accessibility should be. We want these books read, right?




