Optical Character Recognition: Difference between revisions

(Created page with "== Transforming printed pages of Ancient Greek into fully searchable and editable Unicode text == We have developed Optical Character Recognition (OCR) software to turn print...")
 
No edit summary
Line 7: Line 7:
The end result is a high-quality OCR engine for Ancient Greek, with accuracy generally between 90% and 96% for average quality page scans of old printed volumes, now out of copyright. Because it leverages the Tesseract OCR code, our work can be used in a large variety of settings, from server clusters (as Bruce Robertson is doing with his [http://heml.mta.ca/ Heml Text Mining] project) to “apps” on mobile phones (such as the [https://play.google.com/store/apps/details?id=com.renard.ocr Text Fairy] Android app.).
The end result is a high-quality OCR engine for Ancient Greek, with accuracy generally between 90% and 96% for average quality page scans of old printed volumes, now out of copyright. Because it leverages the Tesseract OCR code, our work can be used in a large variety of settings, from server clusters (as Bruce Robertson is doing with his [http://heml.mta.ca/ Heml Text Mining] project) to “apps” on mobile phones (such as the [https://play.google.com/store/apps/details?id=com.renard.ocr Text Fairy] Android app.).


There are several graphical applications (such as [http://sourceforge.net/projects/gimagereader/ gImageReader] and [http://code.google.com/p/lector/ Lector]) that can be used to set up and use the OCR software on a desktop PC: these are quite straightforward to use once installed and correctly configured. We hope to work to make Ancient Greek OCR on the desktop even more straightforward to use in the future, by automating the tricky installation and configuration steps and providing simple packages for Windows and Mac OS X.
There are several graphical applications (such as [http://sourceforge.net/projects/gimagereader/ gImageReader] and [http://code.google.com/p/lector/ Lector]) that can be set up to use the OCR software on a desktop PC: these are quite straightforward to use once installed and correctly configured. We hope to work to make Ancient Greek OCR on the desktop even more straightforward to use in the future, by automating the tricky installation and configuration steps and providing simple packages for Windows and Mac OS X.

Revision as of 15:28, 24 June 2013

Transforming printed pages of Ancient Greek into fully searchable and editable Unicode text

We have developed Optical Character Recognition (OCR) software to turn printed pages of Ancient Greek into searchable, editable Unicode texts. This was challenging because of the large number of different characters, the variety in printing practices, and the small size of diacritics.

We built upon the excellent open source Tesseract OCR engine, “training” it on different Ancient Greek character shapes, wordlists, and some basic grammar. Along the way, we found and fixed several bugs in Tesseract, and significantly improved the project's documentation. We also developed a suite of training tools and OCR testing tools, that have been released under an open source license and have been used by several other people working to improve OCR in different languages.

The end result is a high-quality OCR engine for Ancient Greek, with accuracy generally between 90% and 96% for average quality page scans of old printed volumes, now out of copyright. Because it leverages the Tesseract OCR code, our work can be used in a large variety of settings, from server clusters (as Bruce Robertson is doing with his Heml Text Mining project) to “apps” on mobile phones (such as the Text Fairy Android app.).

There are several graphical applications (such as gImageReader and Lector) that can be set up to use the OCR software on a desktop PC: these are quite straightforward to use once installed and correctly configured. We hope to work to make Ancient Greek OCR on the desktop even more straightforward to use in the future, by automating the tricky installation and configuration steps and providing simple packages for Windows and Mac OS X.