Linux to help the Library of Congress save American history

The Library of Congress, where thousands of rare public domain documents relating to America's history are stored and slowly decaying, is about to begin an ambitious project to digitize these fragile documents using Linux-based systems and publish the results online in multiple formats.

Thanks to a $2 million grant from the Sloan Foundation, "Digitizing American Imprints at the Library of Congress" will begin the task of digitizing these rare materials -- including Civil War and genealogical documents, technical and artistic works concerning photography, scores of books, and the 850 titles written, printed, edited, or published by Benjamin Franklin. According to Brewster Kahle of the Internet Archive, which developed the digitizing technology, open source software will play an "absolutely critical" role in getting the job done.

The main component is Scribe, a combination of hardware and free software. "Scribe is a book-scanning system that takes high-quality images of books and then does a set of manipulations, gets them in optical character recognition and compressed, so you can get beautiful, printable versions of the book that are also searchable," says Kahle.

While previous versions were written for both Linux and Windows, the Internet Archive has migrated Scribe entirely to Linux, and Windows support has been dropped. Kahle says the project uses Ubuntu now.

When asked why the Library of Congress chose Scribe for this project, Dr. Jeremy E. A. Adamson, the library's director for collections and services, replies that the Internet Archive has already demonstrated "the efficient production of high-quality images" with it.

Kahle says that a Linux-based Scribe workstation at the Library of Congress will hold the material to be scanned in a V-shaped cradle -- it doesn't crack books all the way open -- while two cameras take images of it. A human operator performs quality assurance, then Scribe sends the digital images across the breadth of the country to the Internet Archive in San Francisco, where it is processed and eventually posted online in various formats. Free software is used almost every step of the way.

"[It's a] Linux-based station out there in the field. It rsyncs the files up to the servers, [and then] it goes and does the processing on a Linux cluster of over 1,000 machines, and then posts it online -- also on Linux machines," Kahle says.

Image processing for an average book takes about 10 hours on the cluster, and while the project still uses proprietary optical character recognition (OCR) software, Kahle says that many open source applications come into play, including the netpbm utilities and ImageMagick, and the software performs "a lot of image manipulation, cropping, deskewing, correcting color to normalize it -- [it] does compression, optical character recognition, and packaging into a searchable, downloadable PDF; searchable, downloadable DjVu files; and an on-screen representation we call the Flip Book."

The Flip Book is used at The Open Library, a charmingly retro Web interface for online books that mimics old technologies (clicking "Details" for a title brings up a yellowed card catalog entry), which the Internet Archive says was "inspired by a British Library kiosk."

The books are stored in the PetaBox, which is the Internet Archive's massive million-gigabyte storage system -- a system that Kahle says is "all built on open source software."

Caring for brittle books

A good number of the historic materials in question are old, fragile, and in such rough shape that placing them in Scribe's cradle, or even attempting to read them, could irreparably damage them. Adamson says that some of the books, for example, have pages "that have become brittle with age"; while Adamson says these materials are in a broad range of conditions that limit their physical handling, he uses the general term "brittle books" to describe it. No list of such brittle materials at the Library of Congress has been made, but Adamson says that "they comprise a percentage of virtually every collection." Adamson says the project's objectives include the development of a more formal classification and description of these "brittle" materials, and to "establish digitization workflows based on that classification of condition."

If scanning the brittle materials demands new software and digitization techniques, the Library of Congress will work in conjunction with the Internet Archive to make the innovations available to the public. But there's no way to know at this point what they may be, because the project is only getting underway.

"The project proposal calls for months of planning before any scanning or engineering is to begin," Adamson says. And the planning, he says, is "significant": "Space needs to be prepared to accommodate the physical scanning of books, server storage allocated, project plans need to be written, project team members briefed, along with myriad other details required for a project of this magnitude and complexity."

Eventually, Adamson says, when the scanning and processing of materials has been completed, the high-quality digitized versions of these historic documents (and metadata associated with them, such as indices and contents) will be freely accessible online -- which Kahle says is a "huge step" in broadening the reach of the ever-too-small public domain.

"There may be public domain books that are sitting on shelves, but if you can't get access to [something], what good does it do to be in the public domain?" says Kahle. "The Library of Congress is dedicated to keeping [these digitized holdings] public domain, which I think is a great step that's not being followed by everybody else."

The program is part of larger efforts, both at the Library of Congress, to preserve old media and records, and at the Internet Archive, which is already scanning public domain materials with its Open Content Alliance, a consortium of about 40 libraries. Kahle says that the alliance is presently operating in five cities, using the Scribe software, at a brisk clip of 12,000 books a month.

"We're part of the 'open world' through and through -- we use open source software, we generate open source software, we generate open content," says Kahle. "We're trying to take this open source idea to the next level, which is open content and open access to cultural materials, which means 'publicly downloadable in bulk.' I think we're really seeing the next level up of this whole movement -- we had the open network, then open source software, now we're starting to see open source content."