Expanding opportunities: Text Data Mining with Newspapers

In late February 2023, the Leddy Library’s Academic Data Centre at the University of Windsor hosted a workshop series called RDM & TDM in JupyterHub with Newspapers. TDM is an acronym for Text Data Mining (TDM) and one increasingly common approach to TDM highlighted in the workshop is the use of Optical Character Recognition (OCR) from newspapers for text processing. The imagery for several of the newspaper titles used for the workshop was improved to raise the OCR accuracy levels to better serve TDM technologies.

First observed with  the Feb. 4, 1892 edition of the Comber Herald, initial tests with Topaz suggested a 20% improvement in OCR accuracy. This past year has seen the entire collection reprocessed, which allowed the Herald to be included in the corpus for a TDM workshop held at the Leddy Library.
Another sample from 2021, this time the Sept. 22, 1971 edition of the Essex Times. Like the Herald, the Times was completely processed this past year with Topaz.

The series was funded with a grant from Compute Ontario and showcased OurDigitalWorld’s extensive history with newspaper digitization, as well as its long-standing partnership with the University of Windsor. The growing interest in TDM among Ontario libraries was further confirmed by a Colloquium on TDM in Libraries event held at the University of Toronto in early May 2023. The use of newspapers for TDM was a major theme for the colloquium and a common strategy was identified where newspaper collections become substantial data assets for text processing.

ODW collections not only formed the basis of the Windsor workshop series, a subsequent data challenge using the newspaper collection was launched in March, which featured a collaboration with Hackforge, and a kick-off event at the LaSalle public library in partnership with the Essex County Library System. The Compute Ontario grant that supported these activities also provided funding for two graduate students, Akram Vasighizaker and Sumaiya Deen Muhammad, to carry out original research and the results are publicly available on the workshop github site.

TDM is an exciting new direction for newspaper digitization and represents a convergence between recent advances in artificial intelligence (AI) and machine learning with what is frequently the most extensive record of a community’s past, the local newspaper. Unique insights into the past and identifying trends and patterns are enhanced with the power of TDM and digitized newspapers, and it is hoped that ODW can continue to help libraries contribute to this promising area of research.

What’s new with VITA 6.4

VITA Digital Collections Toolkit was upgraded in September 2022, making it easier for user to provide better attribution and search results. This version upgrade means users can automatically assign copyright labels, process text items with OCR and hit highlighting, and share improved display for linked index records and more…

Exciting new changes include:

  • digital files uploaded as category “page” can automatically generate OCR and apply hit highlighting to search results – great for newspaper issues, documents, even headstone photos!
  • copyright holder statements can be automatically applied to serial publications 95 years old or younger (here’s how)
  • index records with links to digital pages will now display the linked page image in the details panel instead of the sidebar
  • personal information and cookies policy statements are now available for both VITA users and the public
  • apply “section” fields for non-newspaper pages e.g. Chapter headings
  • updated “help” for on-screen support (and correlating MAP updates)

Want to stay up to date with VITA Toolkit news? Use the subscription form on the home page of the VITA Help site.

Pilot: Handwritten Character Recognition (HCR)

As part of our digitization post production services, ODW has been achieving excellent results processing handwritten materials with Google’s Optical Character Recognition software. For a pilot project, we processed approximately 1120 duplex pages of pre-1910 handwritten Parish registers (births, marriages, deaths, mainly baptisms) digitized from public-use microfilm. Despite the quality of the images (scratched film and high contrast photography) the page images were split, deskewed, cropped and run through the  OCR software for some very rewarding results.

Applying this to our ongoing work with the Federated Women’s Institutes of Ontario (FWIO), we processed a recent batch of scrapbooks from the Grace Patterson Branch to provide full text search of the entire contents whether handwritten or typed. For all-in-one projects we will continue to apply the HCR software

Moving forward, we intend to experiment with Microsoft’s Azure HCR support which may be surpassing Google’s project — definitely worth trying to compare some pages! The development of HCR is burgeoning at companies like Google and Microsoft, so we can expect progressively better results over time.