Russian Digital Libraries Journal

Russian Digital Libraries Journal - 1999 - Vol 2 - Issue 4


CEDARS. Curl Exemplars in Digital ARchiveS

Peter Fox
Cambridge University Library


For many years librarians and conservators have been studying and solving the preservation problems of paper-based information resources. By now, they are, on the whole, well understood, with the principal constraint on an institution's preservation policy usually being that of finance. As libraries and archives all over the world move towards ever greater dependence on electronic sources of information we encounter preservation problems of a completely different order of magnitude and a completely different type.

In this paper I shall be talking about a specific project that is currently under way to seek to address some of the problems of the long-term preservation of electronic information. This project, called CEDARS (CURL Examplars in Digital ArchiveS), is based in the United Kingdom, but those of us involved with it are conscious that this is a global problem and we are working closely with colleagues in other countries, particularly in the USA and Australia.

From a preservation perspective, electronic information resources differ from their print equivalents in a number of crucial respects. If a traditional conservator has a book or manuscript in front of him, he can usually assess the problem or, by means of a few tests, can measure it. The problems of brittle paper are serious ones, but, nevertheless, the time scale for most papers becoming brittle, if they are stored in reasonable conditions, is at least a good number of decades, if not centuries. With electronic information the time-scale is much shorter.

Electronic resources are distributed to libraries either via a carrier such as a CD-ROM or via a network. In the former case a library wishing to archive the data will have to transfer them to a suitable medium and they will then have to be regularly refreshed or migrated. In the latter case the library's normal right is that of access only. It is likely that the database is held by or on behalf of the publisher and there is a risk that, if the publisher goes out of business or decides that the database is no longer producing a sufficient financial return, it will be lost to scholarship in a way that does not happen with traditional print-on-paper publishing. The research library community is now considering how, in the digital environment, it should exercise the responsibility which it has always undertaken with respect to the preservation of research materials on paper. Should libraries undertake the long-term preservation of electronic information resources? Should new digital archives be established? What are the issues of access, retrieval, etc.? What standards should be adopted? And what are the cost implications?

A serious and active interest in digital preservation among the library community is a relatively recent phenomenon. In 1995 the Research Libraries Group (RLG) and the Commission on Preservation and Access (CPA) in the USA established a joint task force to consider this issue; the task force published its report the following year. In the UK, in November 1995, the JISC (Joint Information Systems Committee of the Higher Education Funding Councils) addressed the question of the preservation of digital media by holding a national conference in Warwick, where a number of action points were identified. The first of these was an analysis of the CPA/RLG report to identify those recommendations in it which were relevant to the UK situation and where research could usefully be undertaken to complement but not duplicate that in progress or planned in the USA and Australia. This analysis resulted in the holding of a seminar in December 1996 at the British Library, which was attended by representatives from the library and archive profession, data archives, and publishers, and where it was agreed that the JISC would fund seven studies on digital archiving, in collaboration with the National Preservation Office, the library, archival and publishing communities.

Those seven JISC/NPO studies were published during 1997 and 1998. A report of the seminar was the first of these studies. The next was an attempt to develop a framework of data types and formats, in order to indicate the likely problems, requirements and responsibilities appropriate to each category, and to identify the most appropriate method of preservation for each category. The third was an investigation into opinions on the responsibility for maintaining an archive of digital materials. In the traditional area of publishing it is quite clear where this responsibility lies: publishers do not regard it as residing with them, and if libraries wish to preserve the books or journals they have bought, then it is their responsibility to do so. In electronic publishing the issues are not nearly as clear. In many cases, for example, libraries do not hold the database - that resides with the publisher. What happens if the publisher goes out of business or loses interest in maintaining the database because the income stream from it has dried up? The report recommended that a national body be established in the UK to co-ordinate such archiving and that it should be funded from the public sector, with an extension of legal deposit legislation to cover electronic publications.

Closely related to the 'framework' study was a comparison of preservation methods and costing models, which aims, on the basis of a matrix of data types, to draw up a decision model to assess the most appropriate method of long term preservation and to produce a further model for comparing the costs of the preferred methods of preservation. A strategic policy framework was developed in another of the studies, which examines how different organisations are approaching the key stages in the life-cycle of digital resources, from creation, through access to preservation. Universities and the funding agencies which support scholarly research are major sponsors of digital resource creation and, therefore, have a responsibility for ensuring that the research they help to create is preserved on a long-term basis. A further study aims to assess the extent of the creation of these digital resources, to investigate what provision is being made now for their preservation and ask what the future needs of these bodies are with regard to digital preservation. Finally, the question of post-hoc rescue, or digital archaeology, is being addressed. We are already in a situation where some data appear to be inaccessible due to the obsolescence of the hardware or software required to read them. It is necessary to study and identify whether there are already major bodies of data which are at risk and what needs to be done to rescue them.

These were all fairly short-term studies, but, given the relative paucity of research in this area, it was considered essential for this basic work to be undertaken before any commitment to funding for practical experiments was made available. The reports are also quite technical, and so the group that managed the programme is also publishing a short book aimed at a non-specialist audience, to try to get the message across to policy makers that when they fund projects that will create digital information, they should also ensure that the projects have a strategy for preserving those data. The book will be published in June 1999 at a seminar which will be attended by representatives of bodies that support the creation of digital information such as universities, the Heritage Lottery Fund, the higher education funding councils, research councils and the Library and Information Commission. The CEDARS Project follows closely from this group of studies and will be participating in the seminar in June.

The CEDARS (CURL Exemplars in Digital ARchiveS) Project itself began in April 1998 when the JISC, as part of its Electronic Libraries Programme (eLib), decided to build on the studies and to continue its support for digital preservation. CEDARS was intended to draw heavily on the JISC/NPO studies and to move that work on from a theoretical to a practical level. The Project, which runs for three years and is funded by the JISC, is based principally at the universities of Cambridge, Leeds and Oxford, but has a number of other participants, including the British Library, the National Preservation Office and several publishers. Through close collaboration with the Research Libraries Group in North America, it will pay close attention to work under way elsewhere and RLG is supporting complimentary work in the USA. UKOLN, the UK Office for Library and Information Networking, is also a partner, with a specific remit to work on the development of metadata standards.

The fundamental aim of the project is to investigate and develop strategies to ensure that the digital information resources typically included in library collections may be preserved in the long-term, alongside other, traditional, non-digital materials. In order to achieve this, the project has three main objectives:

  • to promote awareness about the importance of digital preservation, both amongst university libraries and their users and amongst the data supplying communities upon which they depend
  • to identify, document and disseminate strategic frameworks within which individual libraries can develop collection management policies which are appropriate to their needs and which can guide the necessary-decision-making to safeguard the long-term viability of any digital resources that are included in their collections
  • to investigate, document and promote methods appropriate to the long-term preservation of different classes of digital resources typically included in library collections, and to develop costed and scaleable models.

The range of possible digital resources for inclusion in the project is enormous - text, sound, pictures, moving images, etc. The project has decided to restrict itself to a limited number of categories, in order to identify techniques which can be generalised and extended to cover the full range of digital materials.

These resources include, in various formats:

  • digitised manuscripts from Cambridge University Library and the Oxford Celtic Manuscripts archive, and databases such as the Internet Library of Early Journals (ILEJ), an eLib project led by the University of Oxford, which is digitising runs of eighteenth- and nineteenth-century journals, in some cases from the originals and in some cases from microfilms
  • commercial CD-ROMs, such as the Canterbury Tales published by Cambridge University Press, commercial online databases like Chadwyck-Healey's English Prose Drama
  • electronic books and journals from commercial publishers such as Routledge and Cambridge University Press
  • online databases, including the International Bibliography of the Social Sciences, produced at the British Library of Political and Economic Science at the London School of Economics), and Periodicals Contents Index, another Chadwyck-Healey publication
  • ephemera, such as BUBL, a series of email discussion lists, and a number of sample Websites and subject gateways.

The key elements in what has been selected for the project is that they are examples of materials which fall within the traditional preserve of the research library and which would be part of any research library's normal collecting and preservation policy if it were in paper format. In the electronic version, it is likely still to be part of that library's collecting policy - preservation of it is a matter which most libraries have still not come to grips with and with which the CEDARS Project aims to assist. It has been fundamental to the project from the outset that no attempt will be made to preserve the physical object - we can see no case for creating an archive of CD-ROMs, for example - what the project aims to preserve is the intellectual content regardless of its carrier.

Initially the project identified three 'flavours' or types of material:

  • dynamic data
  • primary sources
  • 'intertwingled' data, by which we mean digital resources where the intellectual content is bound to the structure of the carrier, such as CD-ROMs.

These categories of material were selected because each of them raises different issues and problems and provides examples of the different problems presented. Dynamic online databases, for example, present issues of selection and sampling. Subject gateways provide access to other electronic resources which are sometimes ephemeral. How do we preserve them? Should we even try? Probably the most difficult technical problem is presented by some CD-ROMs: how does one ensure that the data can be removed from the CD to a more permanent storage medium without loss of the context in which the data were created?

It was originally planned that each of the three sites would take responsibility of one of these three technical areas: Cambridge would deal with dynamic data, Oxford with primary resources, and Leeds with 'intertwingled' data.

In addition to the technical problems there were also what one might describe as 'library' issues, particularly those of rights management and metadata - resource discovery and resource description. These issues were also allocated to the three sites, with Cambridge taking responsibility for rights management issues, Oxford for metadata issues, and Leeds dealing with emulation as a preservation strategy for their 'intertwingled' data.

As the project developed over the first year, it became clear that progress was being made with the technical issues and the organisational ones now needed to be addressed in more depth than the original plan had allowed for. It was apparent that probably the most intractable problem of all was rights management. In the United Kingdom, the situation may well be altered by the possible introduction of legal deposit legislation for electronic materials, which the government has promised but not given a date for, but the question remains as to what rights a library or digital archive might reasonably expect from a data provider which will provide adequate long-term security for the database without undermining the financial position of the publisher. It also became evident that more attention had to be given to the whole question of collection management: how does one select what is to be preserved? Does each library make this decision individually or as part of a national or even international strategy? If so, how is such a strategy to be constructed?

This led, following the review at the end of the first year of the project, to a reassessment of the responsibilities of the three sites and a greater emphasis on these organisational questions. It was agreed that the technical issues would be dealt with by Leeds and Oxford, with some continuing input from UKOLN on the question of metadata, and that Cambridge would concentrate on the issues of collection management and rights management. Originally both the university library and the computing service was involved in each site; now the balance has shifted, so that at Leeds, the input is primarily from the computing service, at Oxford each part of the university is involved, and at Cambridge it is entirely a library project.

The project has just passed the end of its first year and has almost two more to run. By the time of its completion in 2001 it is hoped that it will have produced a series of strategies for preservation which are grounded in realistically-scaled projects and contain a representative selection of a good spread of significant resources. It is intended that the strategies are capable of being adapted to other electronic media including those that have not yet been developed. These strategies will probably be based on one or more possible models of what a digital archive might look like, how it will be organised, what sort of services it will provide, and to whom. Should there, for example, be one such archive acting as a national resource, should each research university establish an archive, should a number of institutions each concentrate on particular types of data resource? These are all questions which CEDARS is addressing. I expect that in some cases it will come forward with a clear recommendation; in others it will set out a series of options and allow the relevant bodies to select from among those options, but in the knowledge (which is not available to them at present) of the implications of making any particular choice. These factors will include an assessment of the costs involved and the resources required. In some cases it could well be that the project will recommend that certain types of digital resource either should not or cannot be archived either because of the technical difficulties of doing so or because their intellectual or long-term research value is not sufficiently high to justify the expense of preservation. A variation on this might be that it is agreed that certain forms of digital resource are simply left in the form in which they are currently produced because it is too expensive at present to take the necessary steps to ensure their long-term preservation but technical advances might make such steps cost-effective in the future. The work under way at Oxford, in collaboration with UKOLN, will provide more information on the whole question of metadata and how metadata is attached to digital objects.

CEDARS is essentially a research project of international importance and dissemination of its results forms a major part of the project's work. As part of these activities, CEDARS is contributing to seminars such as this one, and is producing working papers and guides to good practice. Some of these are already available on the website. It is also a very dynamic project that has not yet reached its halfway point; the issues are still being identified and, though the overall direction of the project is clear, the detailed development is a fluid one and we welcome contributions to the discussions on the difficult technical and managerial issues being raised. The website also provides links to other similar projects, particularly the PADI (Preserving Access to Digital Information) site run by the National Library of Australia and the RLG Digital Initiatives web page. I would encourage anyone here who would like to become more involved to look at the CEDARS website and to contact the project team.


© Peter Fox, 1999


Last update - : 2003-12-09

Please address your comments and suggestions to rdlp@iis.ru