Long Term Archiving of Disributed Documents
Recommendations of the European Physical Society EPS
Status: 19.March 2002; Version 0.3;
Draft for the
EPS Action Committee for Publication and Scientific Communication ACPusC
(23.3.2002, Berlin)
Motivation
Long Term Archiving of scientific Documents is one of the key
requisites of research and teaching to ensure professional information
on the experimental and theoretical findings of the past.
This is a recommendation for a rather specific however essential challenge:
Scientific documents are the small subset of documents written by professional
scientists, refereed or by some other means quality certified by scientists
to be useful, necessary or worthwhile to be archived for long term.
Long Term Archiving means for at least the next decades to come.
This differentiates from the simpler task to store, keep it accessible on
webservers and retrieve documents, as long as this is still possible
in the same document format, it was written in, and with the same browser and
operation systems, as in the time , the document was created.
For short and medium time periods, for casting, managing, distributing
scientific documents a well organized and professional system has been
developed over the last hundred years, by commercial and learned society
publishers, distributors, public and University libraries.
This system is driven by the needs of living scientists, willing to pay
the publishers for their services by one or the other direct or indirect
business model.
Long term archiving means archiving now for later, for
future generations,
and for times when publishers might have been gone out of business,
or not interested and fit for this task.
Scientific documents are distributed, being created at the worldwide
distributed physics institutions (about 5.000), and stored at the multitude of
libraries, professional institutes and institutions and distributing scientific documents is
done well organized, with the same browsers as for retrieval of documents
But the task to solve this challenge is for each generation, when the documents
are created, that means: for us.
Memento: Past generations did think of the problem of long term archiving,
but their solution was not really optimal, as seen from now.
Thus, equally, we might fail to be successful, but we cannot refrain from it.
Physics is a learned field with specific requirements, partially quite different
from other learned fields:
- Physics findings are eternal, that is that being either
experimental findings or experimentally proven theoretical structures and
relations, they map quantitatively nature, and this will not change or get
outdated with time.
- Physics findings do not depend on chosen representations for
transmitting them, they are either mathematical structures, coordinate
independent, or experimental number-relations. They do not depend on its
phrasing or chosen explanation by its authors.
Thus long term archiving of physics documents does not require
the preservation of the original form, phrasing, format, language,
letter type, etc. (in contrast to some other sciences, where the exact
phrasing and even graphical presentation has to be
an integral part of the document).
- Conserving physics findings for the future with the aim to
be understandable and useful for research at that time does not require
conservation and archiving of all original publications, although this may
be a useful information on metaphysics questions such as
'who did what when' (priority), that is for sciences other than physics
such as history of physics or
hall of fame.
For long term archiving in phyics the archiving of all refereed published
articles is too much and not necessary.
- Long term archived physics information to be useful then
however needs a selfcontained description, a full presentation
of all experimental data to redo the experiment, or the full detail of
the theoretical calculation, - in sharp contrast to the usual
refereed journal publications which are sketchy and confined to a given
size (pages) of presentation.
For long term archiving of physics to archive journal articles is
not sufficient.
- To be understandable after decades the original presentation of material
may have to be rewritten to be cast into the then used mathematical
description, or in terms of the then used and available experimental tools.
Think of the Maxwell equations: for research in history of physics
the original papers of Maxwell to be accessible is important, but for
nowadays physicists they are hardly readable, -because of the underlying
view to cast them in 'mechanical understanding equations'.
Then came the time of writing them in Cartesian components and three
dimensions, then came the time to use covariant formalism in the four
time-space dimensions, at present the coordinate-independent way of
representing them in terms of Cartan differentials is seen to best
represent the covariant, coordinate-independent character of them and
most suited as a starting point for nonabelian relativistic classical
field theories.
The challenge for long term archiving of physics documents has
on the other hand many similarities with that for documents of other fields:
the should be technically available, retrievable,
accessible, readable.
This means, that if e-formats, operating systems, browsers change,
the content has to be physically translated to the then used
e-Formats, operating systems, browsers.
Thus not only an image of the original document is to be archived but
the full information necessary to restore the content in any future
representation. Thus the whole information of the creator/author given is to be
archived (XML, MathML, LaTeX, etc., not just the .pdf, .ps etc file).
This can only be guaranteed on the long run in question here, if the
recovering processing does not depend on proprietary information, which might
have been lost in time, e.g. that the company who kept it, got out of
business, pp.
Even now, documents written in one version of WORD; Staroffice,
pdf can not be read by other versions.
In Physics primarily needed for long term archiving are not necessarily the
original documents, peer reviewed or not, but the full content of actual
program codes, experimental data, and the physics content
of documents in as much as the
authors, his/her institution, or the physics community wants it long term
archived. This
calls for a large effort for permanent
reconditioning information by quality filtering
such by e.g. multiple levels of vetting, peer reviewing, commenting, condensing, and
especially rewriting the findings and results in a way that the content can be
effectively be used in actual research and teaching,
in the then used language, formalism, dimension units, understanding.
Keypoints for Long Term
Archiving in Physics as seen from author/user
-
Analysis:
-
Physics is the science of the laws of nature,
explored by quantitative experiments and mathematical modelling.
-
Knowledge is needed to be archived in the
best applicable, retrievable, complete, understandable, effective,
professional, and quality filtered forms, - not just as the authors
handed it in originally.
-
Research is served mostly by
communication means: Alerting, preprints, recent reviewed articles,
professional search engines, - thus no apparent need for Long tern archiving in physic, but
in the heat of communication we do not know which part will finally be
picked worthwhile to be archived, thus at creator of the document already
has to assure LTADDP necessary standards.
-
LTADDP is needed mostly for
person-related reasons only:
-
- history of science
-
hall of fame
-
examination works as entry of
professional life
-
Actions for LTADDP:
-
Physics societies urge their members to
condense, recondition their past contributions to physics in a condensed,
compact, complete, most accurate and readable, re-usable form.
-
Publishers stimulate authors to write
present knowledge in a condensed, form.
-
Libraries teach, and enforce the usage of
metadata.
-
Research institutions support development
of intelligent content search and retrieval.
-
International agreements needed for the
standardization of workflow and multiple quality filter steps, and their
metadata flow for the scene from> first publish then evaluate to the
final condensed long term archiving.
-
Enforce copyright policy that first
publish has to be open, either by
author with a preprint server or by his/her institution and its library.
-
Assure only open non-proprietary document
formats for handed in documents.
-
Some steps taken by EPS
-
Metadata:
EPS promotes strict use of metadata
and works on practical tools:
In Physnet of EPS
the MyMetaMaker-webform
serves the author/library to add correct DC-metadata
to the document. Using rdf and XML encoding is in progress (CURD).
-
Oai
Open Archive Initiative: PhysNet of EPS is a registered
OAi-Data-Provider (by 13.4.2001) and serves as OAi-Service-Provider
(since 12.2001). The concept is that the
author's institutions archive themselves the documents, with the connection to
the physics societies repositories (IoPP) for the prime refereed papers as long
term archive.
-
PhysDep of EPS serves a synoptic keyword-matching for
Physics (PACS) from and to Mathematics (MSC).
-
Responsibility stays with the Author and
his/her institution: A distributed
Document Provider PhysDoc
is operating for the EPS since some years, with surf-or-search to about
100,000 documents in physics residing at the Websites of the about 2,000
physics institutions worldwide, served by the EPS through an
international consortium of technicians, led by ISN.
PhysDoc by now comprises the distributed documents at physics
institutions, ArXiv, and leads to all IoPP documents.
Some additions:
-
Physics research and teaching needs
information exchange: full, instant, reusable, no financial barriers,
nor proprietary gateways.
-
Physics needs interactive easy to use
mathematical communication means.
-
Physics needs content search tools:
search for S and for L dt should yield the same
documents independent of which one is written there.
-
Physics needs very professional short
term archiving with powerful search tools for full content understanding
physics content (New intelligent search engines).
-
Long Term archiving needs secure and
no-information-lost retrieval even if the creator or his publisher is no
longer there. That calls for nonproprietary formats and protocols.
Search
independence of source and ownership needed. Three policy lines are emerging:
-
Distributing Queries sent out by search
engine to all document providers (publishers).
The
advantage of this way is that it does not need any agreement or consent by the
document providers.
The
disadvantage of this way is that providers do change their browsing images,
their concepts to present document information, and thus the wrappers have to
be readjusted time by time.
-
The Open Archive Initiative OAi
concept: OAi defines one unique set of rules for the gateway to exchange
document information from document providers to service providers.
The
advantages are: all document providers have the same nonproprietary exchange
protocol and set of metadata defined. They give consent to pull all sets of
metadata by any service provider. They consent to share all software developed.
Thus the document databases complying OAi are most easily and worldwide fully
accible by anyone, which is what research people want. most likely to be read.
The
disadvantages are that the shared metadata set is pretty basic; also, there are
only few OAi compliant service providers, since these compete with their add-on
services and are not willing to give their software away to the competitors. We
assume that learned societies and university libraries most easily are willing
to comply, see a good example by University of Tübingen,
D3P.
-
Bilateral negotiations of individual
document providers with service providers on the amount of exchanged
metadata and links.
The
advantages are that individual and thus optimally fitting agreements can be
found, that the document provider (publisher) keeps all the rights, and can
define different agreements for different service providers, e.g. whether they
are commercial or non-for-profit.
The
key part of any such agreement is that the service provider promises not to
hand the transferred to him sets of metadata to any other service provider
without the explicit consent of the document provider.
References
-
EPS acts as document provider by
PhysDoc, distributing and retrieving
document information from the
multitude
of about 2000 documents and
lists of documents, thus about 100.000 documents, with 40.000 entries, all
located at their individual physics institution of the author.
-
EPS
also acts as OAi compliant service provider
PhysDoc-OAD.
-
Long Term Archiving of Digital Documents in
Physics;
IUPAP
Workshop at Lyon 2001 4.-6.November 2001
-
Eberhard R. Hilf;
Physics Archiving:
Requirements, Perspectives, and some Approaches in Germany;
Session 4 The View of End-User Physicists;
IUPAP workshop LTADP, Lyon, 11.2001;
http://physnet.uni-oldenburg.de/~hilf/vortraege/lyon01/