SEPTEMBER/OCTOBER 2006 (Vol. 8, No. 5) pp. 6-10
1521-9615/06/$31.00 © 2006 IEEE
Published by the IEEE Computer Society
Published by the IEEE Computer Society
Digital Libraries Come of Age
|Connecting the Dots|
|More About Paul Hasler|
PDFs Require Adobe Acrobat
A personal library that fits in your pocket—it will soon exist, if Michael Hart gets his way. Hart heads project gutenberg (PG), the internet's oldest digital library. On 4 July 2006, PG turned 35. Digital libraries in academia and commercial publishing are coming of age right along with it, driven by the latest technology. But to Hart and others involved in digital libraries, the technology is the least interesting part of the story.
Hart created the first PG e-book in 1971, when some friends in a computer lab at the University of Illinois gave him free computer time. To create a document of lasting value, he uploaded the text of the US Declaration of Independence. Now some 20,000 books can be found at http://gutenberg.org—everything from Alice in Wonderland to Zen and the Art of the Internet. Another 80,000 can be found at PG's mirror sites around the world. All books in the project are in the public domain, and most are provided by a devoted cadre of volunteers.
Like typical digital libraries, PG contains two kinds of collections: hard-copy materials (such as books) that have been scanned and converted to plain text via optical character recognition (OCR) software, and newer documents that started out in digital form. Both can be linked by subject, keyword, or metainformation. Images can be tagged with keywords to enable searching in an otherwise plaintext environment.
As PG expands its collection, large brick-and-mortar libraries are trying to digitize their own collections. It's a major enterprise, one that Internet giant Google has taken on. In 2004, Google began a pilot project with the University of Michigan, Harvard University, Stanford University, Oxford University, and the New York Public Library to digitize their collections and make them searchable. Although that project is now the subject of litigation—Google is being sued by the Authors Guild and several publishers for copyright infringement—it's nonetheless pushing digital library technology forward.
Unfortunately, scanning a book can sometimes mean damaging or destroying it. The binding warps text near the crease—what's called guttering—and warped text is hard for OCR programs to read. The result: a text file with garbled words at the beginning or end of every line. One solution is to press the book flat on top of a traditional scanner and crush the binding; another is to rip the book apart and feed it through a scanner page by page. But for rare books, neither option works.
Some desktop scanners have an angled surface, so the spine rests on a peak and one page at a time scans smoothly. Google has taken the care of its borrowed books further, by using a customized scanning machine with a robot hand that delicately turns the pages. Now a University of Kentucky project has yielded a method for scanning books that are too old or damaged to even open (see the " How to Scan a Book [Without Opening It] " sidebar).
Once a book is converted to text, it must be proofread. PG handles this step through a scheme called "distributed proofreading." Volunteers log into the site to proofread an uploaded book; they can read one page or the whole book—whatever they have time for. After one person proofs a page, a second person proofs the proof before it enters the archive.
Most of its books are stored in plain ASCII format. Illustrations from books, including artwork, scientific diagrams, and formulae, are saved as separate image files. Recently, PG began incorporating XML formatting into some documents, so that different reading devices can reformat the text. A text e-book can then become a Web page or a PDF file, with details like italics or images included.
Hart says that the beauty of PG is in its versatility. "I dare you to find a computer-and-program combination that won't read our books. I would just as soon read an e-book on a computer that was 20 years old as one that has all the latest bells and whistles. And I'm perfectly happy reading them on PDAs and PPCs [pocket PCs]."
For e-books to catch on commercially, people will have to adopt a convenient, affordable way of reading them. Portable game players are one possibility; they have an adequate resolution for reading text. But Hart thinks that the "next big thing" for e-books will be a lot smaller than a Game Boy. "In the next year, there will be a billion new cell phones made," he says. "There will only be 100 million computers made. So, what are people most likely to have in their pocket?" He envisions a future in which people carry personal libraries with them, all the time.
That vision isn't so different from the one that Ed Fox had in 1970, during his undergraduate days at the Massachusetts Institute of Technology (MIT). His advisor, Internet pioneer J.C.R. Licklider had just published the book Libraries of the Future (MIT Press, 1965), which predicted—quite prophetically—that libraries would one day exist as electronic repositories shared by an online community. Fox is now director of the Digital Library Research Laboratory at Virginia Tech, and Chair of the IEEE Technical Committee on Digital Libraries ( http://ieee-tcdl.org), which formed in 1997.
Over the years, membership in the committee has remained strong, he says, and the annual ACM/IEEE Joint Conference on Digital Libraries ( www.jcdl.org) is always well attended. Recently he's noticed how the doctoral consortium has grown—more students are earning their PhDs in digital library science and engineering. "In the beginning, the meetings were about defining what digital libraries are," Fox says, "and now we have doctoral students presenting research results and getting advice on the future."
As to the technical challenges of setting up digital libraries, Fox points to several open-source content management systems that now make it easier, in particular DSpace ( http://dspace.org), which the MIT libraries and Hewlett-Packard developed, and Fedora ( www.fedora.info), which Cornell University Information Science and the University of Virginia Library developed.
Of course, e-books aren't the only library materials worth preserving: any respectable digital library needs scholarly journals, too. Journal Storage (JSTOR; www.jstor.org), the nonprofit scholarly journal archive, carries nearly 600 titles, the oldest of which dates back to 1665. Subscribers can read files in TIFF, PDF, or PostScript format.
Archives of a different sort can be found at the US National Science Digital Library (NSDL; http://nsdl.org). Created by the US National Science Foundation (NSF), the NSDL compiles K-16 educational resources from NSF-funded projects, educational Web sites, and other digital libraries. Through a new NSF project, computer scientists at Virginia Tech and Villanova University are building a user interface to connect the NSDL to college-course Web sites.
HighWire Press ( http://highwire.stanford.edu), a division of the Stanford University libraries, boasts the largest repository of free, full-text, peer-reviewed content online. Citations are hyperlinked, and users can register to receive email alerts when a paper on a particular topic appears.
Connecting the Dots
Michael Keller, director of the Stanford University libraries, says that the financial cost of creating a digital library can be substantial. He expects that Stanford will have to allocate 1.5 Pbytes to store the books that Google is digitizing, and estimates that original digital content created by university faculty and students could fill another 100 Tbytes. The challenge, he says, is for universities to afford that much storage and set it up so that one copy of all documents is kept inviolate while another copy becomes a working file accessible by the readership. Such an undertaking also requires enough working memory and CPUs for digital archivists to layer new services above all that content. He calculates that licensing and subscription costs could easily run US$7.5 million per year, and staff, equipment, and other expenses could tally up to $3.5 million per year.
Despite the costs, Keller says that digital repositories have a big payoff for their host institutions. When Stanford digitized its card catalog, circulation of hard-copy materials went up by 50 percent. "That means that students and faculty use the collection 50 percent more because they can search, even in the metainformation about those books, and discover more works of relevance to them," Keller says. "That's a gigantic return on investment."
The legal issues are another hurdle, and he acknowledges that digital libraries will have to work out the appropriate copyright permissions before they can attain their full promise.
Keller says that the life sciences and medicine disciplines have already benefited from online repositories that "connect the dots" buried in masses of text. "I think the same thing will be true in linguistics, anthropology, archeology, and literature," he says. "I think as we look across the arts—music, dance, drama—and across cultures and languages, we're going to see more of these dots being connected. And all that will happen because we can analyze the text across different languages and character sets."
Hart points out that PG already carries some classical MP3s and sheet music. He sees no reason why digital archives can't support all the arts. In fact, he has just returned from a visit to MIT's Fab Lab, where engineers wowed him with 3D printouts—some carved with water jets or lasers, others fabricated from plastic—and now he's thinking about digitized sculptures. "I have printouts of human hands that are so finely detailed that a palm reader could read them," Hart says. "This is a direction for the future. Why should we stop with books? Why should we stop with paintings and pictures? You could print Michelangelo's David, and Donatello's David if you like, and compare them. You could print out every statue in the world."
More About Paul Hasler
Paul Hasler received his BSE and MS degrees in electrical engineering from Arizona State University. He has a PhD in computation and neural systems from the California Institute of Technology. Hasler is an assistant professor at the Georgia Institute of Technology's School of Electrical and Computer Engineering, where he founded the Integrated Computational Electronics (ICE) laboratory, affiliated with the Laboratories for Neural Engineering. Atlanta is the coldest climate in which he has lived.