Long Term Archiving of Disributed Documents

Recommendations of the European Physical Society EPS

Status: 19.March 2002; Version 0.3;

Draft for the EPS Action Committee for Publication and Scientific Communication ACPusC
(23.3.2002, Berlin)

Motivation

Long Term Archiving of scientific Documents is one of the key requisites of research and teaching to ensure professional information on the experimental and theoretical findings of the past.

This is a recommendation for a rather specific however essential challenge:

Scientific documents are the small subset of documents written by professional scientists, refereed or by some other means quality certified by scientists to be useful, necessary or worthwhile to be archived for long term.

Long Term Archiving means for at least the next decades to come. This differentiates from the simpler task to store, keep it accessible on webservers and retrieve documents, as long as this is still possible in the same document format, it was written in, and with the same browser and operation systems, as in the time , the document was created.

For short and medium time periods, for casting, managing, distributing scientific documents a well organized and professional system has been developed over the last hundred years, by commercial and learned society publishers, distributors, public and University libraries. This system is driven by the needs of living scientists, willing to pay the publishers for their services by one or the other direct or indirect business model.

Long term archiving means archiving now for later, for future generations, and for times when publishers might have been gone out of business, or not interested and fit for this task.

Scientific documents are distributed, being created at the worldwide distributed physics institutions (about 5.000), and stored at the multitude of libraries, professional institutes and institutions and distributing scientific documents is done well organized, with the same browsers as for retrieval of documents

But the task to solve this challenge is for each generation, when the documents are created, that means: for us.

Memento: Past generations did think of the problem of long term archiving, but their solution was not really optimal, as seen from now. Thus, equally, we might fail to be successful, but we cannot refrain from it.

Physics is a learned field with specific requirements, partially quite different from other learned fields:

The challenge for long term archiving of physics documents has on the other hand many similarities with that for documents of other fields: the should be technically available, retrievable, accessible, readable. This means, that if e-formats, operating systems, browsers change, the content has to be physically translated to the then used e-Formats, operating systems, browsers.

Thus not only an image of the original document is to be archived but the full information necessary to restore the content in any future representation. Thus the whole information of the creator/author given is to be archived (XML, MathML, LaTeX, etc., not just the .pdf, .ps etc file).

This can only be guaranteed on the long run in question here, if the recovering processing does not depend on proprietary information, which might have been lost in time, e.g. that the company who kept it, got out of business, pp.

Even now, documents written in one version of WORD; Staroffice, pdf can not be read by other versions.

In Physics primarily needed for long term archiving are not necessarily the original documents, peer reviewed or not, but the full content of actual program codes, experimental data, and the physics content of documents in as much as the authors, his/her institution, or the physics community wants it long term archived. This calls for a large effort for permanent reconditioning information by quality filtering such by e.g. multiple levels of vetting, peer reviewing, commenting, condensing, and especially rewriting the findings and results in a way that the content can be effectively be used in actual research and teaching, in the then used language, formalism, dimension units, understanding.

Keypoints for Long Term Archiving in Physics as seen from author/user

  1. Analysis:
    1. Physics is the science of the laws of nature, explored by quantitative experiments and mathematical modelling.
    2. Knowledge is needed to be archived in the best applicable, retrievable, complete, understandable, effective, professional, and quality filtered forms, - not just as the authors handed it in originally.
    3. Research is served mostly by communication means: Alerting, preprints, recent reviewed articles, professional search engines, - thus no apparent need for Long tern archiving in physic, but in the heat of communication we do not know which part will finally be picked worthwhile to be archived, thus at creator of the document already has to assure LTADDP necessary standards.
    4. LTADDP is needed mostly for person-related reasons only:
      1. - history of science
      2. hall of fame
      3. examination works as entry of professional life
  2. Actions for LTADDP:
    1. Physics societies urge their members to condense, recondition their past contributions to physics in a condensed, compact, complete, most accurate and readable, re-usable form.
    2. Publishers stimulate authors to write present knowledge in a condensed, form.
    3. Libraries teach, and enforce the usage of metadata.
    4. Research institutions support development of intelligent content search and retrieval.
    5. International agreements needed for the standardization of workflow and multiple quality filter steps, and their metadata flow for the scene from>  first publish then evaluate to the final condensed long term archiving.
    6. Enforce copyright policy that first publish has to be open,  either by author with a preprint server or by his/her institution and its library.
    7. Assure only open non-proprietary document formats for handed in documents.
  3. Some steps taken by EPS
    1. Metadata: EPS promotes strict use of metadata and  works on practical tools: In Physnet of EPS the MyMetaMaker-webform serves the author/library to add correct DC-metadata to the document. Using rdf and XML encoding is in progress (CURD).
    2. Oai Open Archive Initiative: PhysNet of EPS is a registered OAi-Data-Provider (by 13.4.2001) and serves as OAi-Service-Provider (since 12.2001). The concept is that the author's institutions archive themselves the documents, with the connection to the physics societies repositories (IoPP) for the prime refereed papers as long term archive.
    3. PhysDep of EPS serves a synoptic keyword-matching for Physics (PACS) from and to Mathematics (MSC).
    4. Responsibility stays with the Author and his/her institution: A distributed Document Provider PhysDoc is operating for the EPS since some years, with surf-or-search to about 100,000 documents in physics residing at the Websites of the about 2,000 physics institutions worldwide, served by the EPS through an international consortium of technicians, led by ISN. PhysDoc by now comprises the distributed documents at physics institutions, ArXiv, and leads to all IoPP documents.

Some additions:

      1. Physics research and teaching needs information exchange: full, instant, reusable, no financial barriers, nor proprietary gateways.
      2. Physics needs interactive easy to use mathematical communication means.
      3. Physics needs content search tools: search for S and for L dt should yield the same documents independent of which one is written there.
      4. Physics needs very professional short term archiving with powerful search tools for full content understanding physics content (New intelligent search engines).
      5. Long Term archiving needs secure and no-information-lost retrieval even if the creator or his publisher is no longer there. That calls for nonproprietary formats and protocols.
Search independence of source and ownership needed. Three policy lines are emerging:

  1. Distributing Queries sent out by search engine to all document providers (publishers).

The advantage of this way is that it does not need any agreement or consent by the document providers.

The disadvantage of this way is that providers do change their browsing images, their concepts to present document information, and thus the wrappers have to be readjusted time by time.

  1. The Open Archive Initiative OAi concept: OAi defines one unique set of rules for the gateway to exchange document information from document providers to service providers.

The advantages are: all document providers have the same nonproprietary exchange protocol and set of metadata defined. They give consent to pull all sets of metadata by any service provider. They consent to share all software developed. Thus the document databases complying OAi are most easily and worldwide fully accible by anyone, which is what research people want. most likely to be read.

The disadvantages are that the shared metadata set is pretty basic; also, there are only few OAi compliant service providers, since these compete with their add-on services and are not willing to give their software away to the competitors. We assume that learned societies and university libraries most easily are willing to comply, see a good example by University of Tübingen, D3P.

  1. Bilateral negotiations of individual document providers with service providers on the amount of exchanged metadata and links.

The advantages are that individual and thus optimally fitting agreements can be found, that the document provider (publisher) keeps all the rights, and can define different agreements for different service providers, e.g. whether they are commercial or non-for-profit.

The key part of any such agreement is that the service provider promises not to hand the transferred to him sets of metadata to any other service provider without the explicit consent of the document provider.

References

  1. EPS acts as document provider by PhysDoc, distributing and retrieving document information from the multitude of about 2000 documents and lists of documents, thus about 100.000 documents, with 40.000 entries, all located at their individual physics institution of the author.
  2. EPS also acts as OAi compliant service provider PhysDoc-OAD.
  3. Long Term Archiving of Digital Documents in Physics;

    IUPAP Workshop at Lyon 2001 4.-6.November 2001

  4. Eberhard R. Hilf; Physics Archiving: Requirements, Perspectives, and some Approaches in Germany;
    Session 4 The View of End-User Physicists; IUPAP workshop LTADP, Lyon, 11.2001; http://physnet.uni-oldenburg.de/~hilf/vortraege/lyon01/