Semantic interpretation of files
Semantic interpretation of files
Background
There are thousands of different file formats which, to make matters worse, can also be nested inside each other: For example, a Debian GNU/Linux package is an ar archive that contains, among other things, a gzip-compressed tar archive with the files, e.g. an OpenOffice.org document, which is nothing other than a zip archive with several xml text files and other embedded files such as png images or films, which in turn contains an mpeg2 video and several ogg-vorbis audio tracks in a QuickTime-mov container format. This package file is then located in an ISO-9660 file system, which is provided via a loop-back device from a file on an ext3 file system, which is formed via RAID5 across several SATA hard drives.
At the lowest level, the bits are stored as zeros and ones; they can only be interpreted correctly with the necessary contextual knowledge. Troubleshooting is particularly time-consuming when errors occur, as the description of the formats often only exists in text form and has to be painstakingly worked through by the person searching for the error.
Job description
As part of this work, the results already achieved in a previous project are to be further developed. In addition to the expansion to include additional file formats, the programme needs to be revised in order to better support some data types such as enumerations and character strings. Integration or co-operation with Strigi and Nepomuk is also possible.