One of the projects I’ve had the privilege of working on over the last year is Readux [source code on GitHub], which aims to make Emory Libraries digitized books discoverable by collection and available for research, scholarship, and teaching. This work has been all the more gratifying because I worked on an earlier version of the project that sadly never went into production (see more on the project history).
One of the interesting technical challenges has been making both page images and text available and useful. There are plenty of “page-turner” or “book reader” applications (e.g., Internet Archive BookReader, or the IIIF viewer Mirador), but it inevitably it seems they are written in a different progamming language, or difficult to adapt, modify, and integrate into a larger application. The technologies underlying these applications are all probably pretty similar: page images, scans or photographs of physical book pages, with text somewhere that has been generated through OCR (optical character recognition), usually used for search but not display, since OCR tends to be unreliable.
Some of those interfaces (like the Internet Archive BookReader) use the OCR to highlight search terms on the page image; the Google Books interface provides a tool to select a portion of the image and then it supplies the text from that section of the page. This is a pretty neat feature, but it’s not always as useful as you might want, because you can only select images in rectangular shapes, which doesn’t always match the text flow, and the interface doesn’t seem to be designed to display large sections of text.
In Readux, we want the page content to be annotatable—both the images and the text content—so we went a different route. The OCR text is put on the page, invisibly layered over the image. It’s positioned and sized as close as possible to the original text on the image, based on the information encoded in the OCR. The text is kept invisible, because even for the volumes where OCR works well, there are still enough errors that it’s easier to read the image rather than the OCR (and in some cases, the OCR is really quite terrible and unreadable). Putting the text on the page makes it easy for users to take advantage of web browsers’ normal text functionality, such as the ability to search on the page’s text or highlight it, either to copy it for pasting elsewhere or for annotation.
I recently learned about Project Naptha, a new browser-based tool to identify and expose text within images. The result, when it works well, looks remarkably similar to what we’re doing in Readux—except that, instead of trying to find the text in the image after the fact and on the fly, we’re providing access to the text data that has already been generated.
How it works
To start with, we wanted the text and position data in a common format. So we wrote XSLT to convert our OCR XML to TEI facsimile. Because it’s intended to document digital images of source materials, TEI facsimile includes tags and attributes that can fully describe the position and size of blocks and text on the page. This maps quite well to the positional data in our OCR. Our digitized volumes already include two major kinds of OCR XML output: two versions of Abbyy OCR, and the newer METS/ALTO. As technologies and standards change, we may eventually have other formats; but as long as we can translate the information to TEI facsimile, we have a common format that we can use.
Sometimes it’s easier to understand things by looking at examples, so here are some links for sample content in Readux and some of the code I mentioned. If you’re interested in the HTML layout and positioning, use your browser developer tools on some Readux volume pages to inspect some of the text content.
- A page of music from 1911 Original Sacred Harp. This volume has one METS/ALTO XML file for each page, with word-level position information.
- A page of text from Ladies First!. This volume has one Abbyy OCR XML for the entire volume, with line-level position information.
- Custom Django template tag code that does the majority of the position and size calculations