Using open-source OCR tools to digitize Indigenous print


Recently ODW’s Art Rhyno consulted on the newspaper digitization pilot project for the National Heritage Digitization Strategy. Our years of work on newspaper digitization have included specialized work with open-source tools to perform text-recognition on underrepresented languages. Digitizing non-Latin scripts can sometimes be difficult with commercial tools, but open-source software allows for customization with almost all syllabics and words.

Art worked on Inuktitut syllabic recognition for the Multicultural History Society of Ontario a few years ago, and has been sharing the language files and modifying code for use with open-source text recognizer Tesseract ever since.

A few other languages digitized during the process:

Art also has suggestions about how to work Tesseract into a whole digitization workflow with other open-source tools, and we’re always available to answer questions if you have similar projects.

