Recently ODW’s Art Rhyno consulted on the newspaper digitization pilot project for the National Heritage Digitization Strategy. Our years of work on newspaper digitization has included specialized work with open-source tools to perform text-recognition on underrepresented languages. Digitizing non-Latin scripts can sometimes be difficult with commercial tools, but open-source software allows for customization with almost all syllabics and words.
Art worked on Inuktitut syllabic recognition for the Multicultural History Society of Ontario a few years ago, and has been sharing the language files and modifying code for use with open-source text recognizer Tesseract ever since.
A few other languages digitized during the process:
Art also has suggestions about how to work Tesseract into a whole digitization workflow with other open-source tools, and we’re always available to answer questions if you have similar projects.
— RG (@VirtualRielity) July 2, 2015