OCR for 17th/18th centuries printed work
OCR for 17th/18th centuries printed work

OCR for 17th/18th centuries printed work

Is it possible to train an AI or build a tool to recognize printed text from 17th/18th centuries ?

I’m a librarian for an orchestra that plays mostly early music. Part of my job is to make scores by copying (and modernizing) prints from the 17th/18th centuries. While working on an opera I often need to work with the text from the opera, and often there is no modern version of the lyrics available, only scans of the original prints (like here : https://www.loc.gov/resource/musschatz.19874.0?st=gallery )

I tried using « classic » OCR tools, like the built-in features from PDFexpert, the one built into Mac’s preview (which was better) or even one supposedly specialized with this kind of document called rescribe.

None of them gave me good, or even passable results, with errors in almost every word.

My question is : is it possible to train a model on the kind of fonts used for these documents and make it correct the output not based on modern language but with their ancient spelling and wording ? And make it correct a word based on its context in the sentence (or the story)?

For instance it was very common to use a kind of elongated « s » instead of our modern « s », and the OCR tools then recognizes a « f » or a « / ».

Could you point me in a direction to maybe find such a tool or solutions to build it myself ?

submitted by /u/Envelki
[link] [comments]