Pilot: Handwritten Character Recognition (HCR)

As part of our digitization post production services, ODW has been achieving excellent results processing handwritten materials with Google’s Optical Character Recognition software. For a pilot project, we processed approximately 1120 duplex pages of pre-1910 handwritten Parish registers (births, marriages, deaths, mainly baptisms) digitized from public-use microfilm. Despite the quality of the images (scratched film and high contrast photography) the page images were split, deskewed, cropped and run through the  OCR software for some very rewarding results.

Applying this to our ongoing work with the Federated Women’s Institutes of Ontario (FWIO), we processed a recent batch of scrapbooks from the Grace Patterson Branch to provide full text search of the entire contents whether handwritten or typed. For all-in-one projects we will continue to apply the HCR software

Moving forward, we intend to experiment with Microsoft’s Azure HCR support which may be surpassing Google’s project — definitely worth trying to compare some pages! The development of HCR is burgeoning at companies like Google and Microsoft, so we can expect progressively better results over time.

What’s new with VITA 6.3

In April 2022, the VITA Digital Collections Toolkit was upgraded to version 6.3. This release includes a balance of public engagement features and back-end management options. Inspired by feedback from the user community on both sides of the collections, VITA 6.3 focuses on: increasing linked discovery (like indexing non-Newspaper volumes); better search options (like search within Publication and on/off filters for results sets); expanding and scoping VITA collection audiences (with OAI-PMH integrations and IP restricted sites, respectively); and some fun stuff like interactive jigsaw puzzles and enhanced pan-zoom viewing. We hope you will explore the collections to see some of the changes!

Improved & Engaging Public Site options

  • Contribute Audio/Video/Document files to eligible accounts
  • Search within a Publication (e.g. Home & Country Newsletters) allows your results to stay focused on a single volume or newspaper publication
  • Jigsaw puzzles for a different way to interact with historical images
  • Optional indexing for non-newspaper volumes like church or cemetery records
  • Results filters for instant scoping and backing up through results sets
  • Browser-activated audio/video player
  • IIIF viewer for pan-zoom-rotate view of all Full, Detail and Reverse images (e.g. this postcard)
Jigsaw Puzzles

New Audience Options: Integrations and Scoped sites

  • OAI-PMH feature for extending discovery in other spaces like the Digital Public Library of America (DPLA)
  • IP limited sites for collections with access restrictions (talk to us about this option)

For a full description of all VITA 6.3 upgrades for the public and VITA users, see our latest VITA Partner newsletter

Broken tiles: A retro-conversion project

Over time, certain file formats become obsolete. When ODW implemented the first pan-zoom viewer in the VITA Toolkit in the 2010s, it was based on uploading large files made up of hundreds of little tiles all zipped into a folder. The once-free tool is called Zoomify. Over the years, we encouraged our users to “Zoomify” their full images and any pages of multipage items so that those items could be zoomed into and rotated for a dynamic user experience. This was particularly useful for scrapbooks where pasted items were sometimes in different orientation within a single page. Also, detailed items like the Welland Canal Records benefitted from this “Zoomification”. However, these folders of tiles were quite “heavy”, i.e. required more storage and some eventually became corrupt. 

Zoomify Tool “tiling” an image

Luckily, as technology has advanced and streamlined, the standard is now to use JPEG-2000 (JP2) files that automatically trigger the open-source IIIF (International Image Interoperable Framework)viewer in VITA. So, any user uploading full images, details, or pages can upload the considerably lighter and mobile-friendly JP2 file and it displays with all the pan, zoom and rotate options people expect for viewing this kind of material online. The trick was that we needed to go through our system and replace the old Zoomify folders with JP2 files. We were able to do this systematically for the most part, but some stubborn items required manual intervention and conversion. We were lucky to have Christine Anderson, a Mohawk College Library Technician student, who was willing and able to take on the task. Here’s Christine’s take on the project:

In my time at ODW, I have worked on (and completed) the Dezoomify project which primarily involved using the VITA Toolkit to access and replace collection images and other software for the conversion process. ODW provided me with a list that identified records with broken Zoomify files and I got started on the clean-up-work!

My primary task was to open and convert the broken Zoomify files and then replace them with JP2 files. This was done for Full images, some Details and Reverse images, as well as for many book and scrapbook pages. Using a RecordID list that was organized by Agency, I could identify all of the records with images that needed to be replaced and re-loaded. 

This work was accomplished by:

  • Using the Dezoomify tool which works by copying and pasting the item’s public URL into the tool
  • “Dezoomify” merges the tiles that make up a Zoomify file and that merged image can then be saved as a JPG
  • I used Irfanview software to convert JPG files to JP2 files, and I assigned their original file names so that the agencies could trace the display files back to their master copies
  • In the data management side of the VITA toolkit, I then activated a task-specific button to replace the broken Zoomify files with newer (and unbroken) JP2 image files
  • When certain Zoomify files were identified as too corrupt and this simpler workflow did not work, a workaround was created:
    • In some cases, I could open the PDF file associated with pages and save them as JP2 – although these tend to be quite large, so we adjusted the quality during the conversion process to reduce the storage overhead
    • In other cases, where there was no PDF, I would open an alternate JPG file for Full and Detail images and simply used the standard “Replace” button for the Full or Details file
  • The new files then automatically populate along with their records and now remain either public or non-public according to their original setting.

The JP2 files open in a IIIF viewer and provide excellent Pan-Zoom capabilities, like the slideshow below illustrates.

The Dezoomify project concentrated mostly on file creation and replacement (for example: digital collections from libraries’ local history/genealogy departments), and to an extent included working on the Metadata for the files submitted. The project consisted of a bunch of repetitive tasks that were not able to be automated and had to be manually manipulated/updated. This was important database work that will ensure the integrity and currency of the files uploaded to the clients’ digital collections and sites going forward.

There will always be advancements in technology standards and these inevitably require adjustment and retroconversion activities. With Christine’s work complete, the ODW team was able to purge a considerable overhead of corrupt and cumbersome Zoomify folders from the database. The positive outcomes of this work is a reduction in the affected agencies’ storage and the cumulative burden of these obsolete files on the servers, plus Christine gained new technological skills that she can carry forward in her career as a Library Technician. It’s a win-win!