Our team of developers has decades of collective experience working in collections management, digitization, and other heritage technologies. We’re always looking for like-minded, passionate developers to work with us on special projects and useful tools.

We have big plans to revamp our tools and infrastructure to an open source, community-based infrastructure. Part of that is making our work available on our Github repository. Follow us there or get in touch if you want to be involved!

Open data

Any search you perform on OurOntario.ca can be pulled in various metadata formats: Dublin Core, MODS, Solr, RSS, Atom, and now RDF! That’s data from over 200 organizations in a usable Linked Open Data format. We continue to add ways to make data accessible, whether it’s hosted in VITA or indexed in our search portals.

OCR

We’re pioneering open-source Optical Character Recognition (OCR) development and deployment, with focus on newspaper digitization and adding multilingual support. In 2013, ODW trained Tesseract to recognize Inuktitut for a publication called Inuit Today, digitized by the Multicultural Historical Society of Ontario. We presented on this work at the 2014 OLA SuperConference; here are our slides (PDF). From sharing these resources online, we were able to assist Nunavut Tunngavik Inc. with an expanded training set.

In 2021/2022, as part of our digitization post production services, we have been achieving excellent results processing handwritten materials with Google’s Optical Character Recognition software for some very rewarding results.

In 2022, we introduced automatic OCR processing to our VITA Digital Collections Toolkit, allowing any typed page to be processed during upload for full text search and hit highlighted results.

Data Normalization

We’ve done lots of work using Open Refine for normalizing newspaper indexes prior to ingestion to the VITA Toolkit. Often legacy indexes are captured in unwieldy systems or documents. We generate clean copies for ingest with OpenRefine by: reformatting fields for their intended format (such as machine-readable dates, subjects, titles, personal or corporate names); getting rid of duplicate records; fixing inconsistencies in spelling and capitalization to create a more controlled vocabulary; adding authorized Library of Congress Subject Headings and geolocations to provide dynamic links in the final display in VITA. See our Youtube uploads for a series of videos on OpenRefine for data normalization.