Recently ODW’s Art Rhyno consulted on the newspaper digitization pilot project for the National Heritage Digitization Strategy. Our years of work on newspaper digitization have included specialized work with open-source tools to perform text-recognition on underrepresented languages. Digitizing non-Latin scripts can sometimes be difficult with commercial tools, but open-source software allows for customization with almost all syllabics and words.
Art worked on Inuktitut syllabic recognition for the Multicultural History Society of Ontario a few years ago, and has been sharing the language files and modifying code for use with open-source text recognizer Tesseract ever since.
A few other languages digitized during the process:
Art also has suggestions about how to work Tesseract into a whole digitization workflow with other open-source tools, and we’re always available to answer questions if you have similar projects.
Thanks to Art Rhyno (OCR guru from the U.ofWindsor) we can #OCR Inuktitut syllabics using #tesseract – ᖁᔭᓐᓇᒦᒃ Art! pic.twitter.com/qUfHv0yhTh
— RG (@VirtualRielity) July 2, 2015