Recently ODW’s Art Rhyno consulted on the newspaper digitization pilot project for the National Heritage Digitization Strategy. Our years of work on newspaper digitization have included specialized work with open-source tools to perform text-recognition on underrepresented languages. Digitizing non-Latin scripts can sometimes be difficult with commercial tools, but open-source software allows for customization with almost all syllabics and words.

Art worked on Inuktitut syllabic recognition for the Multicultural History Society of Ontario a few years ago, and has been sharing the language files and modifying code for use with open-source text recognizer Tesseract ever since.


A few other languages digitized during the process:


Art also has suggestions about how to work Tesseract into a whole digitization workflow with other open-source tools, and we’re always available to answer questions if you have similar projects.