Contact

Press & Communication

+49 (0) 441 798-5446

  • Oldenburg scientist Prof Dr Jörg Lücke and his colleagues have developed an automatic method for text cleaning. Photo: iStock/popovaphoto, Editing: Per Ruppel

Making the illegible legible - even in even in Klingon

Blotted, scribbled on or mouldy in the archive: Oldenburg scientist Prof Dr Jörg Lücke is researching how dirty and therefore unreadable texts can be cleaned automatically. The newly developed text cleaning software can do a lot more in the future.

Blotted, scribbled on or mouldy in the archive: Oldenburg scientist Prof Dr Jörg Lücke is researching how dirty and therefore unreadable texts can be cleaned automatically. The newly developed text cleaning software can do a lot more in the future.

Lücke and his Sheffield colleague Dr Zhenwen Dai have published their findings on how a computer, scanner and printer can be used as a "washing machine" for texts in the October issue of the renowned journal TPAMI. Statistics are the key to successful cleaning. Letters - for example in a newspaper article - are regular, repetitive patterns, while dirt patterns such as coffee or ink stains very rarely look the same. The newly developed computer programme first looks at a soiled text many times and learns which regularly repeating patterns (i.e. letters) it consists of. The programme then memorises the "cleanest" examples for each letter in order to replace each individual letter step by step. The result is a clean text.

The newly developed text cleaning software is the result of a project funded by the German Research Foundation (DFG), in which researchers from the Universities of Oldenburg, Frankfurt am Main, Sheffield (UK) and the Technical University of Berlin are involved. Around half a million euros in funding has so far been pledged for the work entitled "Non-linear probabilistic models for representation-based recognition and unsupervised learning on visual data".

The special highlight is the independence of the language or alphabet of the text: As the programme first learns the letters, it also works with text in the imaginary language Klingon (from the TV series "Starship Enterprise"), for example. Another difference to commercially available text recognition programmes is its ability to deal with particularly heavy soiling.

One challenge so far is the large amount of computing capacity required, as project leader Lücke reports: "Due to the enormous computing effort, we can currently only handle very small alphabets, and yet we need a computing cluster with 15 graphics card processors to achieve the results presented." However, direct applicability was not the primary goal of the research, but rather the basic testing of the new method. In future, automatic text recognition programmes or software for restoring old journal texts could benefit from it.

Lücke also sees a benefit of the results for the recognition of spoken language and the analysis of medical image data. "In both cases, severe 'contamination' in the form of noise and signal distortions currently pose the greatest challenges." One example of this is the often poor performance of current speech recognition programmes in the presence of background noise. "With our new method, we now have a tool to tackle these challenges."

This might also be of interest to you:

No news available.
(Changed: 21 May 2026)  Kurz-URL:Shortlink: https://uol.de/p82n781en
Zum Seitananfang scrollen Scroll to the top of the page

This page contains automatically translated content.