We have big plans to revamp our tools and infrastructure to an open source, community-based infrastructure. Part of that is making our work available on our Github repository. Follow us there or get in touch if you want to be involved!
Any search you perform on OurOntario.ca can be pulled in various metadata formats: Dublin Core, MODS, Solr, RSS, Atom, and now RDF! That’s data from over 200 organizations in a usable Linked Open Data format. We continue to add ways to make data accessible, whether it’s hosted in VITA or indexed in our search portals.
We’re pioneering open-source Optical Character Recognition (OCR) development and deployment, with focus on newspaper digitization and adding multilingual support. In 2013, ODW trained Tesseract to recognize Inuktitut for a publication called Inuit Today, digitized and shared by the Multicultural Historical Society of Ontario. We presented on this work at the 2014 OLA SuperConference; here are our slides in PDF. From sharing these resources online, we were able to assist Nunavut Tunngavik Inc. with an expanded training set.
Digital Public Library Pilot Project
Since 2015, ODW has been working on a pilot project in collaboration with Ryerson University Library and Archives, the British Columbia Provincial Digital Library group, and the Digital Public Library of America community to build a digital public library that can ingest and display institutions’ digital cultural heritage. We’re committed to an open-source and community-based solution to aggregating Canadian content; right now we’re working with SuppleJack, an ingest tool created by Digital New Zealand. So far we’ve been lucky enough to offer two paid positions on this project – Andrew Park in 2015, and Matt Barry in 2016-2017. Our work will be available on Github.
We’ve done lots of work using Open Refine for normalizing newspaper indexes prior to ingestion to the VITA Toolkit. Often legacy indexes are captured in unwieldy systems or documents. We generate clean copies for ingest with OpenRefine by: reformatting fields for their intended format (such as machine-readable dates, subjects, titles, personal or corporate names); getting rid of duplicate records; fixing inconsistencies in spelling and capitalization to create a more controlled vocabulary; adding authorized Library of Congress Subject Headings and geolocations to provide dynamic links in the final display in VITA. See our Youtube uploads for a series of videos on OpenRefine for data normalization.