Optical Character Recognition: Difference between revisions
No edit summary |
No edit summary |
||
Line 4: | Line 4: | ||
We built upon the excellent open source [https://code.google.com/p/tesseract-ocr Tesseract] OCR engine, “training” it on different Ancient Greek character shapes, wordlists, and some basic grammar. Along the way, we found and fixed several bugs in Tesseract, and significantly improved the project's documentation. We also developed a suite of training tools and OCR testing tools, that have been released under an open source license and have been used by several other people working to improve OCR in different languages. | We built upon the excellent open source [https://code.google.com/p/tesseract-ocr Tesseract] OCR engine, “training” it on different Ancient Greek character shapes, wordlists, and some basic grammar. Along the way, we found and fixed several bugs in Tesseract, and significantly improved the project's documentation. We also developed a suite of training tools and OCR testing tools, that have been released under an open source license and have been used by several other people working to improve OCR in different languages. | ||
The end result is a high-quality OCR engine for Ancient Greek, with accuracy generally between 90% and 96% for average quality page scans of old printed volumes. Because it leverages the Tesseract OCR code, our work can be used in a large variety of settings, from server clusters (as Bruce Robertson | The end result is a high-quality OCR engine for Ancient Greek, with accuracy generally between 90% and 96% for average quality page scans of old printed volumes. Because it leverages the Tesseract OCR code, our work can be used in a large variety of settings, from server clusters (as Bruce Robertson has done with his [http://heml.mta.ca/ Heml Text Mining] project) to apps on smartphones (such as the [https://play.google.com/store/apps/details?id=com.renard.ocr Text Fairy] Android app.). | ||
Downloads, | Downloads, usage instructions, and more information can be found at the [http://ancientgreekocr.org/ Ancient Greek OCR website]. |
Revision as of 19:12, 5 May 2014
Transforming printed pages of Ancient Greek into fully searchable and editable Unicode text
We built upon the excellent open source Tesseract OCR engine, “training” it on different Ancient Greek character shapes, wordlists, and some basic grammar. Along the way, we found and fixed several bugs in Tesseract, and significantly improved the project's documentation. We also developed a suite of training tools and OCR testing tools, that have been released under an open source license and have been used by several other people working to improve OCR in different languages.
The end result is a high-quality OCR engine for Ancient Greek, with accuracy generally between 90% and 96% for average quality page scans of old printed volumes. Because it leverages the Tesseract OCR code, our work can be used in a large variety of settings, from server clusters (as Bruce Robertson has done with his Heml Text Mining project) to apps on smartphones (such as the Text Fairy Android app.).
Downloads, usage instructions, and more information can be found at the Ancient Greek OCR website.