Converting Scanned Images of the Print History of the World to Knowledge: A Reference Model and Research Strategy
Within the coming decade, the world’s printed heritage will be digitized. If national governments play a pro-active role, this process could have massive potential benefits for human development through democratization of access to the full range of printed materials. The requisites for success are comprehensive digitization programs that make page images openly available and encourage the adding of value by combining OCR of full text with the encoding of meaning carried in typographic conventions, giving words their context of their functions within documents – as titles, references, captions etc.
Linking scanned pages to bibliographic metadata and exploiting OCR to encode character strings are well understood methods of adding value to scanned books. But more research is required to extract useful knowledge of typographic (printing and page layout) conventions for use in encoding images of printed books. This paper examines some of the precepts of encoding information embedded in print conventions and explores how the resulting knowledge-bases can be exploited, together with semantic understanding, to yield vastly enriched cultural content. It recommends strategies at the national level that could ensure that images of printed pages will become interoperable knowledge-bases leveling access to the printed heritage of mankind for all and ensuring diversity in the digital heritage
For the past two decades, governments and private interests have funded a number of small, selective book and newspaper digitization projects. Building on strategies developed for microfilming, these projects were most often organized with scholarly advisory committees charged with choosing the works to digitize. Their outputs were hundreds or a few thousand, books each. Collectively they have scanned hundreds of thousands of volumes. This lack of scale was one of the motivators for the “Million Book Project”, a project at Carnegie Mellon University, though as of June 2004, only 50,000 books had been scanned in that project (Carnegie Mellon University Libraries & Troll, 2005). The dynamics of the field changed in 2004 when Google announced that it would partner with University of Michigan, Harvard University, Stanford University, The New York Public Library, and Oxford University libraries in a project involving tens of millions of books that makes their information discovery business more attractive (Google, 2005).
The possibility of broad (perhaps comprehensive) commercial access to print heritage has caused national libraries to re-examine their practices in light of a perceived new competition. A review of project costs shows that a significant portion of spending to date has been for selection – choosing which titles to digitize – a discriminatory process often based on goals that by definition reduces diversity in collections. Studies at places like the Internet Archive show that with attention to process the cost of digitization on a comprehensive scale can be reduced to less than the cost of purchasing a single copy of the average book. This changes the nature of digitization, and opens up the possibility for digital access to the compete printed collections of national libraries (Internet Archive, 2005).
With modest costs associated with mass digitization, national libraries can begin to plan comprehensive digitization projects on a national scale. Ian Wilson, Librarian and Archivist of Canada is taking a lead in this area, as he noted in his closing plenary address at the Museums and the Web 2005 conference. Parallel efforts are also in planning in several European countries. The pace of the uptake of this new approach to digitization is such that it seems reasonable today to claim that the world’s printed literature will be fully scanned within a decade (except that the last 2-3% may elude us and require a concentrated effort to fill in the gaps).
Comprehensive digitization has several important consequences. First, once it is completed, every user of a library or the Internet at large could (in principle) have access to the same literature, regardless of the size or remoteness of the community in which they live. This represents access that is more complete than any user has now, regardless of their status or access to the greatest libraries. Secondly, it means that the concept of books being in copyright but “out-of-print” – a category that covers more than a third of the literature of the world – will disappear; any book in digital form could be “in-print”. As a consequence, a new social contract will need to be struck with the holders of copyrights to permit electronic access to most of the literature of the past century. An appropriate model to address this intellectual property issue is an extension of the concept of “public lending right” to provide compensation to the creators of cultural content worldwide based on use rather than perceived value.
Comprehensive digitization also provides the raw materials –the scanned images of book pages – for a host of value-added services that could provide meaningful uses far beyond today’s limited searching of bibliographic metadata, and viewing of books pages on-line. While having the full-texts of all books searchable through OCR within a decade is a laudable goal, we need to push for more in order to find the meaning in all that text. We need to use semantic knowledge to create abstracts and indexes, link references and allusions, and reveal the development of national languages as never before possible. Fourthly, we have an opportunity to make the world’s print heritage a universal knowledge-base, by exploiting meaning that is embedded print conventions to identify, extract and link citations, tables of contents, definitions from reference works, facts from charts and tables, and much else. Such a knowledge-base could lead the researcher to the relevant part of a text, empower the reader with the ability to generate a scholarly edition, expose the history of thought and influences, and ultimately – through the knowledge in multi-lingual dictionaries and thesauri – contribute to the transcending of linguistic barriers and the creation of a sharable cultural heritage of mankind.
Let us explore this fourth contribution further, because it is at present little understood.
Converting Scanned Images
The technologies of print over manuscript provided a great democratization of access to writing. Over the past four hundred years, graphic designers and printers have developed conventions to display text on the page in ways that facilitate efficient reading, in particular by distinguishing between word with different contextual meanings by formatting them in another way. The page itself is designed with blocks of text that are visually distinct from each other, by being surrounded by more or less white-space, or off-set from other text blocks in alignment. Areas of text may be at the head or foot of the page, in what elsewhere is the margin, or centered in a page or column. Within these text blocks, certain words or groups of words may be in a different typeface (font) or size from those typical within the block or the text as a whole. For example they may be above, below or beside a line of type, in italics, or underlined. All these clues help humans read: a task we perform first at the level of display conventions, before we “read” the words. From these conventions, we know instantly (from our experience in reading) whether the text we are looking at is a novel, a letter, a scholarly article, a dictionary, a bank statement or other genre. Highly literate people make distinctions between hundreds of common genres and dozens of specialized professional genres. Each of these genres is distinguished by layout and typographic clues; the knowledge they contain can be implied, and located, by understanding and exploiting these clues, independent, even, of whether we understand the written language.
Some quite structured genres, like reference books, financial accounts, scientific reports and the like, enable us to "read" the contents of a printed text as if it was database schema. Punctuation marks, spacing, typeface, and letter forms combine to indicate to readers that “entries” of relatively uniform structure, are being presented. In other genres of a more narrative nature, an hierarchical definition much like an XML or SGML schema is indicated by less frequent typographic clues, showing where sections of the printed text are divided and identifying areas with different characteristics, for example, having a high level of content significance, being contributed by a subsequent author/editor, or serving as illustrations of the major points. Different sections of printed works are correlated with distinctive layout, so that the table of contents, preface, introduction, acknowledgements, indexes, and references, are distinguishable. Within a text, such critical information as captions for illustrations, axes of graphs, headings and subheadings, the occurrence of lists and their correlations with descriptive properties is signaled through typographic design.
Currently, in our digitization efforts, we are throwing virtually all of this information away, by focusing almost exclusively on the translation of print images to “characters” encoded in ASCII or Unicode. The effect, if we continue to overemphasize this dimension of print, will be to create a massive “full text” database of the world’s literature in the languages in which is was written. Admittedly, this would be a significant advance over our current state and provide for better access to our collective print heritage than anyone has today; but it would also be a great shame for having failed to achieve much more.
For centuries scholars, often in religious communities of monks, rabbis and mullahs, and in academia, have engaged in the construction of value-added, cross-textual, knowledge tools, such as concordances, indexes, and full scholarly editions. Today, we can computer-generate many intellectual tools that once required lifetimes of labor; concordances or word usage lists can be created in an instant for any text we digitize. Other tools that have enabled us to effectively build on past understanding include abstracting, citation and bibliographic meta-databases, which could similarly be created by machine though with the need for more understanding of the printed sources. We believe that the functional equivalent of scholarly editions could be generated from information extracted from page images of the world’s printed heritage, propelling human knowledge to new discoveries. On-line availability of these texts would also enable the vast majority of mankind – to date without adequate access to the resources that serve as the foundation creativity – to contribute new discoveries and perspectives.
Two strands of technical achievements with printed texts need to be brought together for us to make this leap. The first is the abstract knowledge of what is “important” to understand about printed texts. This kind of knowledge is reflected in the Text Encoding Initiative (TEI) Guidelines developed by scholars of texts over the past decade for “encoding” digital representations of texts so that formal and abstract properties meaningful to researchers can be “marked” in the digitized text and extracted independently (see http://www.tei-c.org/). The effect, in the thousands of texts that have been thus encoded, is to create a database out of a single text that reflects the concerns of scholarly editors (Burnard, O'Keeffe, & Unsworth, 2005). Despite having been developed for a digital age, however, the limitations of the TEI initiative are those of scholarly editing: only small numbers of texts have been or could be so intensively treated by dedicated scholars. Nevertheless, their having done so with many important texts means we have at hand a grammar for expressing those genre features that have been considered meaningful in printed text analysis and a model for their extraction into larger scale digital library systems (see for example (Crane, 2005)).
The second strand of technical contributions comes from work in image recognition systems. In the late 1990’s, researchers driven primarily by the potential for capturing structured data from contemporary documents of business, government and health care, and secondarily by interests in automatic generation of bibliographic metadata, indexes and abstracts, and citation databases, made significant progress in page image segmentation and the recognition of structured content from print conventions ((Doermann, Rivlin, & Rosenfeld, 1998), (Okun, Doermann, & Pietikainen, 1999), (Doermann, 1998)). More recently, significant progress has been reported with decoding the print convention embedded schema of multilingual dictionaries (Ma, Karagol-Aran, & Doermann, 2003). Improved techniques in machine learning, exponential increases in processing speed, and decreases in costs of image processing could all contribute to major breakthroughs.
The full requirements for a representation language for the meaning embedded in print conventions and the range of new functionality it would support are not completely known. We do not yet know the extent to which each national literature will require distinctive rules in a representation language, or the best ways to decode conventions as they evolved historically within a given print tradition. Nor can we say whether genres need to be recognized first at the level of the volume, or book section, or whether the patterns will be built up from recognition of individual paragraphs and text blocks. In the field of image understanding, we are not yet confident that page segmentation and statistical analysis of departures from the normative type faces, spacing and other features are sufficient to acquire all the display conventions with semiotic significance. An iterative programme of analysis and machine learning will be required as we build universal libraries of book images.
When digitizing their print heritage, national libraries and others should make the best possible scanned images available so that initial decoding work can be immediately put to use. Refinements in knowledge extraction methods that will be developed over the decade can be applied to images iteratively. It is important that the meaning made from books scanned early in the process, both by semantic methods and by extraction of structure from form, can be used in the understanding of books scanned later. Encouragement should be provided to those adding value to the repository of digital images of print, including private ventures that might briefly exploit proprietary methods of representation to commercial advantage. As long as the base resource, the digital images of the scanned print literature of the world, remains openly available, such innovation will enhance broader access over time.
What can we anticipate such a commitment would achieve? Unlike any preservation activity ever previously undertaken the new representations of books will be vastly superior from a functional perspective to what they replace. Simple scanning yields only “fast paper”, and “mere” OCR yields “only” full-text retrieval, but if we digitize printed books using systems that are aware of typographic conventions and that exploit semantic analysis enabled by full-text, we should be able to obtain value-added “knowledge” from image recognition systems. Post-processing against the whole knowledge-base could reveal the historical meanings of words and ideas in older literature to contemporary readers, in foreign literatures to their non-native readers, and in specialized literatures to multi-disciplinary readers. Buildings, cities, people, boats and other subjects with proper names could be linked to data and images that depict them at the appropriate time. Allusions that authors make to earlier texts, and references made to them by later texts, can be linked, as is now done only in the most complete of scholarly editions.
If we take care in our digitization efforts to ensure access for page-display-based content tagging, the heritage of a nation will not simply be available, as print is, volume by volume. Rather it will become integrated at the level of paragraphs, sentences and words in their proper contexts as titles, references, image captions, definitions and the like. With attention to international standards for the representation of this knowledge, anyone will be able to “generate” the kind of view of a text that is now only available for those rare texts that have been the subjects of projects for concordances, indexes, abstracting services, and citation databases. The ability to create scholarly editions of any printed book on the fly will be the final confirmation that we have crossed a threshold. But the power of making universal print available for all does not end there – the social affects are likely to be as enormous as they are now unknown, and the intellectual contribution could be to add many additional kinds of value-added textual products to the repertoire that scholars have created individually in the past.
Universal access to the knowledge contained in the comprehensive printed record of mankind is a goal that can be achieved within a decade. With appropriate public policies and the development and deployment of standards, not addressed in this paper as they are the subject of a separate processes, most of this knowledge could be provided free at the point of use while increasing the return to copyright holders over the life of copyright protected publications. With proper attention to the representation of contextual knowledge in print images and the interchange of raw forms of this data, national governments can ensure the emergence of the first integrated and comprehensive resource representing the print literature of the world. Human creativity and capabilities will be unleashed without regard to barriers previously imposed by limited access to information, barriers that effectively kept everyone, no matter how privileged, from having full access to universal print heritage and supported dramatic differences between a very few who had significant access and the overwhelming majority that had virtually none.
- Print heritage should be converted to open source digital collections by national governments, and value-added uses encouraged. Governments should do so to make their cultural heritage more widely available and as a strategy for preservation.
- A “digital lending right” should be created to provide universal access to all out-of-print works, through collaboration between national governments and creative communities. This would remove a barrier to the mass democratization of information access and make a contribution to the survival of some threatened languages.
- Print conventions should be exploited to extract structured knowledge from printed texts and make it available to all. Simply making images of printed works, even with value-added bibliographic metadata and OCR’d text, is not sufficient. In so doing, national efforts should develop and employ international standards for text encoding to ensure the construction of a universal knowledgebase of the world’s printed heritage.
Burnard, L., O'Keeffe, K. O. B., & Unsworth, J. (2005). Editors' Introduction. In L. Burnard, K. O. B. O'Keeffe & J. Unsworth (Eds.), Electronic Textual Editing: Modern Language Association and the TEI Consortium.
Carnegie Mellon University Libraries, & Troll, D. (2005, April 28, 2005 --). Frequently Asked Questions About the Million Book Project. Retrieved May 17, 2005, 2005, from http://www.library.cmu.edu/Libraries/MBP_FAQ.html
Crane, G. (2005). Document Management and File Naming. In L. Burnard, K. O. B. O'Keeffe & J. Unsworth (Eds.), Electronic Textual Editing: Modern Language Association and the TEI Consortium.
Doermann, D. (1998). The Indexing and Retrieval of Document Images: A Survey. Computer Vision and Image Understanding, vol.70, p.287-298.
Doermann, D., Rivlin, E., & Rosenfeld, A. (1998). The Function of Documents. Image and Vision Computing, 16, 799-814.
Google. (2005). Google Print Library Project. Retrieved May 17, 2005, from http://print.google.com/googleprint/library.html
Internet Archive. (2005). Canadian Libraries [Project Description]. Retrieved May 17, 2005, from http://www.archive.org/details/toronto
Ma, H., Karagol-Aran, B., & Doermann, D. (2003). Segmenting and Tagging Structured Content. Symposium on Document Image Understanding Technology, 53-64.
Ninch Symposium: April 8, 2003, New York City "The Price of Digitization: Resources", http://www.ninch.org/forum/price.resources.html
Okun, O., Doermann, D., & Pietikainen, M. (1999). Page Segmentation and Zone Classification: The State of the Art, LAMP-TR-036; CAR-TR-927; CS-TR-4079 (Technical Report): University of Maryland.
 David Bearman is President of Archives & Museum Informatics. He consults on issues relating to electronic records and archives, integrating multi-format cultural information and museum information systems and is Founding Editor of the quarterly journal Archives and Museum Informatics, published by Kluwer Academic Publishers, in The Netherlands through 2000. Since 1991, he has organized and chaired the International Conferences on Hypermedia and Interactivity in Museums (ICHIM), and more recently also the Museums and the Web Conferences, as well as directing numerous educational seminars and workshops on related topics. Bearman is the author of over 125 books and articles on museum and archives information management issues. Prior to 1986 he served as Deputy Director of the Smithsonian Institution Office of Information Resource Management and as Director of the National Information Systems Task Force of the Society of American Archivists from 1980-82. From 1987-1992, he chaired the Initiative for Computer Interchange of Museum Information (CIMI). In 1989 Bearman proposed Guidelines for Electronic Records Management Policy which were adopted by the United Nations Administrative Coordinating Committee on Information Systems (ACCIS) in 1990 and in 1995 he proposed a Reference Model for Business Acceptable Communications as part of a research program on functional requirements for electronic evidence. He also served as Director of Strategy and Research for the Art Museum Image Consortium (AMICO). For a full vita and publications list see http://www.archimuse.com/consulting/bearman_cv.html
 Jennifer Trant is a Partner in Archives & Museum Informatics. She is co-chair of Museums and the Web and ichim, and has served on the program committee of the JDL Joint Digital Libraries 2001 conference, and the Board of the Media and Technology Committee of the American Association of Museums. Trant served as the Executive Director of the Art Museum Image Consortium (AMICO). She was Editor-in-Chief of Archives and Museum Informatics: the cultural heritage informatics quarterly from Kluwer Academic Publishers from 1997-2000. Prior to joining Archives & Museum Informatics in 1997, Jennifer Trant was responsible for Collections and Standards Development at the Arts and Humanities Data Service, King's College,
David Bearman - President of Archives & Museum Informatics
Jennifer Trant - Partner in Archives & Museum Informatics
© David Bearman, Jennifer Trant, 2005