Long Term Archiving of Digital Documents in
Physics
IUPAP Workshop at Lyon 2001 4.-6.November 2001
Session 4 The View of End-User Physicists
Physics Archiving:
Requirements, Perspectives, and some Approaches in Germany
Eberhard R. Hilf, Institute for Science Networking ISN;
Oldenburg, Germany
The paper will
be out by 15th November 2001.
http://physnet.uni-oldenburg.de/~hilf/vortraege/lyon01/
Abstract
It is
argued that in Physics primarily needed for long term archiving are not the
original documents, peer reviewed or not, but the full content of actual
program codes, experimental data on one side, and the physics content of documents in as much as the
authors, his/her institution, or the physics community wants it archived. This
calls for a large effort in reconditioning information by qualtiy filtering
such as multiple levels of vetting, peer reviewing, commenting, condensing, and
especially rewriting the findings and results in a way that the content can be
effectively be used in actual research and teaching.
Content
- Keypoints for LTADDP as seen from the
author/user
- Requirements as seen from the user and the
author
- Some approaches for long term archiving in
Germany
- What we do at the Institute for Science
Networking
Keypoints for Long Term
Archiving in Physics as seen from author/user
- Analysis:
- Physics is the science of the laws of nature,
explored by quantitative experiments and mathematical modelling.
- Knowledge is needed to be archived in the
best applicable, retrievable, complete, understandable, effective,
professional, and quality filtered forms, - not just as the authors
handed it in originally.
- Research is served mostly by
communication means: Alerting, preprints, recent reviewed articles,
professional search engines, - thus no apparent need for Long tern archiving in physic, but
in the heat of communication we do not know which part will finally be
picked worthwhile to be archived, thus at creator of the document already
has to assure LTADDP necessary standards.
- LTADDP is needed mostly for
person-related reasons only:
- - history of science
- hall of fame
- examination works as entry of
professional life
- Actions for LTADDP:
- Physics societies urge their members to
condense, recondition their past contributions to physics in a condensed,
compact, complete, most accurate and readable, re-usable form.
- Publishers stimulate authors to write
present knowledge in a condensed, …… form.
- Libraries teach, and enforce the usage of
metadata.
- Research institutions support development
of intelligent content search and retrieval.
- International agreements needed for the
standardization of workflow and multiple quality filter steps, and their
metadata flow for the scene from first publish then evaluate to the
final condensed long term archiving.
- Enforce copyright policy that first
publish has to be open, either by
author with a preprint server or by his/her institution and its library.
- Assure only open non-proprietary document
formats for handed in documents.
- Some steps in Germany and Europe
- Metadata: IuK initiative information and
communication of the learned societies promotes strict use of metadata
and works on practical tools:
MyMetaMaker-webform serving the author/library to add correct DC-metadata
to the document, using rdf and XML encoding.
- Oai
Open Archive Initiative: DINI , the
German Initiative for Network-Information, appealed to the University
libraries to serve as Oai document providers, organized classes, where we
teached how to install
Oai-compliant data- and service providers. Our ISN-Oai compliant
document provider was approved 13th April 2001. Our Oai
service provider is online. The best OAi-service provider is by the
University Library of Tübingen.
- Successful keyword-matching tool for
Physics (PACS) from/to Mathematics (MSC) now online (Project: Carmen).
- Multilevel quality filters: GAP: German Academic Publishers are to
have a fine grid of quality filters (from author, group, institute,
University, peer reviewing)
- Long Term Archiving for Theses
(Dissertations): In Germany a full scheme has been set up for all learned
fields, the complete workflow from candidate through faculty to
University library to LTA by The National Library (DDB) with a mutually
agreed set of rules for acceptable formats (with the original file kept)
and metadata (who is allowed to put in which how).
- Portals
to Physics are set up by the DPG in collaboration with FIZ Karlsruhe,
TIB, and ISN. Responsibility is with the new DPG-workgroup Information
- Responsibility stays with the Author and
his/her institution: A distributed
Document Provider PhysDoc
is operating for the EPS since some years, with surf-or-search to about
100,000 documents in physics residing at the Websites of the about 2,000
physics institutions worldwide, served by the EPS through an
international consortium of technicians, led by ISN.
Some additions:
- Physics research and teaching needs
information exchange: full, instant, reusable, no financial barriers,
nor proprietary gateways.
- Physics needs interactive easy to use
mathematical communication means.
- Physics needs content search tools:
search for S and for Ldt should yield the same
documents independent of which one is written there.
- Physics needs very professional short
term archiving with powerful search tools for full content understanding
physics content (New intelligent search engines).
- Long Term archiving needs secure and
no-information-lost retrieval even if the creator or his publisher is no
longer there. That calls for nonproprietary formats and protocols.
OAD: Progress in
OpenArchive Approach and Status
PhysDoc is
by now OAi compliant data - and service provider, with the concept that the
author's institutions archive themselves the documents, with the connection to
the physics societies repositories (IoPP) for the prime refereed papers as long
term archive.
What we do at the Institute
for Science Networking
Search
independence of source and ownership needed. Three policy lines are emerging:
- Distributing Queries sent out by search
engine to all document providers (publishers).
Technically,
wrappers are to be written for each document provider. The output is
then parsed for the retrieval.
MetaPhys by P. Borrmann has been a good example
collecting from all major physics publishers.
The
advantage of this way is that it does not need any agreement or consent by the
document providers.
The
disadvantage of this way is that providers do change their browsing images,
their concepts to present document information, and thus the wrappers have to
be readjusted time by time.
- The Open Archive Initiative OAi
concept: OAi defines one unique set of rules for the gateway to exchange
document information from document providers to service providers.
The
advantages are: all document providers have the same nonproprietary exchange
protocol and set of metadata defined. They give consent to pull all sets of
metadata by any service provider. They consent to share all software developed.
Thus the document databases complying OAi are most easily and worldwide fully
accible by anyone, which is what research people want. most likely to be read.
The
disadvantages are that the shared metadata set is pretty basic; also, there are
only few OAi compliant service providers, since these compete with their add-on
services and are not willing to give their software away to the competitors. We
assume that learned societies and university libraries most easily are willing
to comply, see a good example by University of Tübingen, D3P.
- Bilateral negotiations of individual
document providers with service providers on the amount of exchanged
metadata and links.
The
advantages are that individual and thus optimally fitting agreements can be
found, that the document provider (publisher) keeps all the rights, and can
define different agreements for different service providers, e.g. whether they
are commercial or non-for-profit.
The
key part of any such agreement is that the service provider promises not to
hand the transferred to him sets of metadata to any other service provider
without the explicit consent of the document provider.
- We act as document provider by PhysDoc, distributing and retrieving
document information from the multitude of about 2000 documents and
lists of documents, thus about 100.000 documents, with 40.000 entries, all
located at their individual physics institution of the author.
We
act as Service provider, via MetaPhys, using servlets.
- For OAi we act as document providers,
extracting those about 1.000 documents which have correct metadata of the
distributed PhysDoc system. We became officially registered as OAi
compliant by 13th of April 2001.
We
also act as OAi compliant service provider PhysDoc-OAD, to be officially registered next month. Here,
at PhysDoc-OAD we serve the retrieval of
- about 01.000 documents of PhysDoc
- about 50.000 documents of ArXiv (2001,
2000 and some more),
- about 90.000 documents of IoPP
In total,
at present we serve about 120.000 documents, increasing. Of these, we serve the
document information of virtually all IoPP journals. The retrieval leads to the
repository of IoPP and it depends then on the arrangement between user and the
respective IoPP journal, whether the full document is retrieved or not.
Cooperation
with Publishers
We praise
the smooth, competent and fruitful cooperation of the Institute of Physics
Publishing, UK, useful for both sides: PhysDoc II serves the increase of reading IoPP-Journal stack
documents, it serves these in the context of other related documents of other
providers, thereby focussing on the competition of quality and add on services
of providers, during the transfer of the 90.000 metadata files only about a 100
were found to be incorrect XML encoding [mostly misunderstanding of
mathematical symbols which have a different meaning in XML] were corrected at
ISN (one week work) and were transferred to IoPP for future usage. The concept of bilateral agreement of
metadata stack exchange allows a professional adaptation to the individual
needs of the partners.
Keyword
mapping
Theory of
one field calls for tools for keyword finding of the other and vice versa.
Here are some examples, which our code gave and best satisfy the users, which you might not have expected.
MSC 78A60 Lasers, masers, optical bistability, nonlinear optics <=> PACS
03.30.+p Special relativity
62.30.+d Mechanical and elastic waves; vibrations
<=> 74S15 Boundary element methods
03.65.-w Quantum mechanics <=>
47-XX Operator theory
76M10 Finite element methods <=>
44.10.+i Heat conduction
Citation, copy right and link of
this document
This document may be copied, distributed,
downloaded in any way even for talks and slides or quoted, as long as its
content is not changed in any way, and the source is corectly cited as
E. R. Hilf; Physics Archiving: Requirements, Perspectives, and some
Approaches in Germany;
http://physnet.uni-oldenburg.de/~hilf/vortraege/lyon01/;
at Long Term Archiving of Digital Documents in Physics, IUPAP Workshop
5.Nov.2001.