Expanding opportunities: Text Data Mining with Newspapers

In late February 2023, the Leddy Library’s Academic Data Centre at the University of Windsor hosted a workshop series called RDM & TDM in JupyterHub with Newspapers. TDM is an acronym for Text Data Mining (TDM) and one increasingly common approach to TDM highlighted in the workshop is the use of Optical Character Recognition (OCR) from newspapers for text processing. The imagery for several of the newspaper titles used for the workshop was improved to raise the OCR accuracy levels to better serve TDM technologies.

First observed with  the Feb. 4, 1892 edition of the Comber Herald, initial tests with Topaz suggested a 20% improvement in OCR accuracy. This past year has seen the entire collection reprocessed, which allowed the Herald to be included in the corpus for a TDM workshop held at the Leddy Library.
Another sample from 2021, this time the Sept. 22, 1971 edition of the Essex Times. Like the Herald, the Times was completely processed this past year with Topaz.

The series was funded with a grant from Compute Ontario and showcased OurDigitalWorld’s extensive history with newspaper digitization, as well as its long-standing partnership with the University of Windsor. The growing interest in TDM among Ontario libraries was further confirmed by a Colloquium on TDM in Libraries event held at the University of Toronto in early May 2023. The use of newspapers for TDM was a major theme for the colloquium and a common strategy was identified where newspaper collections become substantial data assets for text processing.

ODW collections not only formed the basis of the Windsor workshop series, a subsequent data challenge using the newspaper collection was launched in March, which featured a collaboration with Hackforge, and a kick-off event at the LaSalle public library in partnership with the Essex County Library System. The Compute Ontario grant that supported these activities also provided funding for two graduate students, Akram Vasighizaker and Sumaiya Deen Muhammad, to carry out original research and the results are publicly available on the workshop github site.

TDM is an exciting new direction for newspaper digitization and represents a convergence between recent advances in artificial intelligence (AI) and machine learning with what is frequently the most extensive record of a community’s past, the local newspaper. Unique insights into the past and identifying trends and patterns are enhanced with the power of TDM and digitized newspapers, and it is hoped that ODW can continue to help libraries contribute to this promising area of research.

What’s new with VITA 6.4

VITA Digital Collections Toolkit was upgraded in September 2022, making it easier for user to provide better attribution and search results. This version upgrade means users can automatically assign copyright labels, process text items with OCR and hit highlighting, and share improved display for linked index records and more…

Exciting new changes include:

  • digital files uploaded as category “page” can automatically generate OCR and apply hit highlighting to search results – great for newspaper issues, documents, even headstone photos!
  • copyright holder statements can be automatically applied to serial publications 95 years old or younger (here’s how)
  • index records with links to digital pages will now display the linked page image in the details panel instead of the sidebar
  • personal information and cookies policy statements are now available for both VITA users and the public
  • apply “section” fields for non-newspaper pages e.g. Chapter headings
  • updated “help” for on-screen support (and correlating MAP updates)

Want to stay up to date with VITA Toolkit news? Use the subscription form on the home page of the VITA Help site.

Digitizing the Angelo Principe Italian-Canadian Newspaper Collection

Adapted from The ‘Angelo Principe’ Italian Canadian Newspaper Collection by Dr. Matteo Brera

Mastehad of La Vittoria (The Victory) Italian-Canadian newspaper

In 2014, researcher and scholar Dr. Angelo Principe donated his extensive newspaper and book collection to the Clara Thomas Archives and Special Collections of York University Libraries. The ‘Angelo Principe Collection’ includes materials entrusted to him for preservation by Italian Canadian activists from the first half of the twentieth century like Attilio Bortolotti and Benny Bottos, as well as the surviving documents belonging to Augusto Bersani, transnational political activist, facilitator and secret agent for the Royal Canadian Mounted Police (RCMP).

Six years later, a key part of the collection was digitized in a collaboration between Michael Moir, Head of the Clara Thomas Archives and Special Collections, and OurDigitalWorld, resulting in a unique online collection of rare interwar Italian-language newspapers published in North America. These include Il Bollettino Italo-Canadese, Il Cittadino Canadese, Il Giornale Italo-Canadese, Il Lavoratore, L’Araldo del Canada, L’Italia, L’Italia Nuova, L’Italo Canadese, L’Operaio Italo-Canadese, La Vittoria, La Voce degli Italo-Canadesi, and La Voce Operaia. The newspapers were processed using OurDigitalWorld’s multilingual Optical Recognition Software (OCR) and are full text searchable in both English and Italian.

The significance of this donation cannot be overstressed. Thanks to Michael Moir’s vision in working with OurDigitalWorld, and to Dr. Matteo Brera for his work adding rich contextual and descriptive metadata to the collection items, Dr. Principe’s legacy for the study of the construction of the Italian Canadian identity and transcultural exchanges between the Old and the New World is manifest in this online collection, providing an invaluable research tool to be used and enjoyed by scholars and the community.

Explore the collection at https://vitacollections.ca/yul-italiancanadiannewspapers/search

This research and digitization project was conceptualized and directed by Dr. Matteo Brera (mbrera@yorku.ca) and was made possible by generous funding from the Zorzi Family Italian-Canadian Archival Fund, established in 2017 and dedicated to encouraging the study of Italian-Canadian archival materials. The project was also sponsored by York University’s Faculty of Liberal Arts and Professional Studies.  

Daily British Whig 1902-1926 now online

OurDigitalWorld is excited to announce that the Daily British Whig from 1902-1926 is online. The Frontenac Heritage Foundation undertook the project to digitize this significant set of community news, covering the first of the World Wars, and make the papers available as part of the larger Kingston newspaper collection hosted by the Kingston Frontenac Public Library.

With the addition of these almost 90,000 pages, the online Kingston newspaper collection has doubled and now ranges more than 100 years, from 1810-1926. The Digital Kingston VITA Toolkit site at http://vitacollections.ca/digital-kingston/search allows users to search by keyword and facet results to sort or narrow them by date, publication, and more.

Daily British Whig October 9, 1909

OurDigitalWorld worked with Library and Archives Canada via the Canadian Research Knowledge Network to access and digitize the microfilm copies, and with University of Windsor to achieve high quality positional OCR processing. The newspapers are uploaded into the VITA Digital Toolkit for search and display with full text search and hit highlighted results. Frontenac Heritage Foundation member John Grenville used the new primary materials to research a local architect Ernest Beckwith, designer of the Orpheum Theatre in Kingston, and returned very specific results.

ODW, Kingston Frontenac Public Library and the Frontenac Heritage Foundation encourage genealogists, students, and other researchers’ use and exploration of this important set of newspapers. To read the full press release and for contact information regarding the project, click here.

Featured Image courtesy of Maritime History of the Great Lakes Digital collection