Expanding opportunities: Text Data Mining with Newspapers

In late February 2023, the Leddy Library’s Academic Data Centre at the University of Windsor hosted a workshop series called RDM & TDM in JupyterHub with Newspapers. TDM is an acronym for Text Data Mining (TDM) and one increasingly common approach to TDM highlighted in the workshop is the use of Optical Character Recognition (OCR) from newspapers for text processing. The imagery for several of the newspaper titles used for the workshop was improved to raise the OCR accuracy levels to better serve TDM technologies.

First observed with  the Feb. 4, 1892 edition of the Comber Herald, initial tests with Topaz suggested a 20% improvement in OCR accuracy. This past year has seen the entire collection reprocessed, which allowed the Herald to be included in the corpus for a TDM workshop held at the Leddy Library.
Another sample from 2021, this time the Sept. 22, 1971 edition of the Essex Times. Like the Herald, the Times was completely processed this past year with Topaz.

The series was funded with a grant from Compute Ontario and showcased OurDigitalWorld’s extensive history with newspaper digitization, as well as its long-standing partnership with the University of Windsor. The growing interest in TDM among Ontario libraries was further confirmed by a Colloquium on TDM in Libraries event held at the University of Toronto in early May 2023. The use of newspapers for TDM was a major theme for the colloquium and a common strategy was identified where newspaper collections become substantial data assets for text processing.

ODW collections not only formed the basis of the Windsor workshop series, a subsequent data challenge using the newspaper collection was launched in March, which featured a collaboration with Hackforge, and a kick-off event at the LaSalle public library in partnership with the Essex County Library System. The Compute Ontario grant that supported these activities also provided funding for two graduate students, Akram Vasighizaker and Sumaiya Deen Muhammad, to carry out original research and the results are publicly available on the workshop github site.

TDM is an exciting new direction for newspaper digitization and represents a convergence between recent advances in artificial intelligence (AI) and machine learning with what is frequently the most extensive record of a community’s past, the local newspaper. Unique insights into the past and identifying trends and patterns are enhanced with the power of TDM and digitized newspapers, and it is hoped that ODW can continue to help libraries contribute to this promising area of research.

OurDigitalWorld Newsletter March 2023

Challenge Accepted: Data Mining with Digitized Newspapers

Text Data Mining is an exciting new direction for newspaper digitization projects, leveraging the most recent advances in artificial intelligence (AI) and machine learning with what is frequently the most extensive record of a community’s past, the local newspaper.

Exploring Multicultural History

Recent work by Guanqiao (Tony) Fu, a 4th year student in the History program at University of Toronto Mississauga, has resulted in three exhibit projects focusing on the Portuguese Canadian Diaspora, Chinese Migrants in Canada, and Ontario’s Experiences of Wars and Conflicts for the past 100 years.

Ontario Legislature Scrapbook Hansard Indexing Project

The earliest Hansard records were made by Journalists and published in Canada’s early national newspapers. These were clipped into a scrapbook and stored at the Ontario Legislative Library. The microfilmed books were scanned in 2019 and uploaded using the VITA Toolkit for public access. Almost 4000 index records have been uploaded so far, with particular attention to capturing personal names and subject headings that detail the Parliamentary session discussions and Acts.

Holodomor Survivor Videos

Working together, the Holodomor Research Educaton Centre (HREC) and Ukrainian Canadian Research & Documentation Centre (UCRDC) have shared the first batch of a large collection of Holodomor Survivor interviews.

The OurOntario.ca search site has a fresh new look!

Our favourite discovery site is now even easier to browse, with updated media type and contributor links, cleaner look and feel, on/off facets for results sets, and more.

Community newsletter highlights pioneer diaries

The South Marysburgh Mirror is a community newsletter from Prince Edward County, Ontario. Highlights include the transcription of Nelson Hicks’ Diaries from the turn of the 20th century.

Uncovering the Collections

Discover the vast number of resources in the VITA Toolkit collections including: Soldiers & Veterans, Built Heritage, Church Records & Vital Statistics, City & Telephone Directories.

Read the full newsletter and explore the linked collections

Celebrating Black History Month – every month

This month we’re highlighting collections that take a deep dive into the Black experience. For Black History Month on social media, we’ve showcased some significant resources in VITA Collections for understanding and exploring Black history in Ontario and beyond.

As always, we recommend reviewing and searching the Abolitionist Newspaper collections. Past blog posts have featured these papers in articles like Fugitive Voices: Black-run periodicals in Abolition-era Canada and news about the re-scanning of the Voice of the Fugitive newspaper in Announcing the Abolitionists Collection.

South Western Ontario was a major crossing point for fugitive slaves and freemen coming from the United States. To learn more about this aspect of Chatham Kent and area, check out the wonderful exhibit “Let us march on until Victory is won,” from Chatham Kent Museum https://vitacollections.ca/ckmuseums/620/exhibit.

Family history collections often end up at local archives and public libraries, one is the Richard Bell Family Fonds at Brock University includes 85 photographs, tintypes and documents spanning c.1850 to 1950. The extended family lived in London and St. Catharines and the collection includes birth, death and marriage certificates, images from family bibles, snapshots from family day trips, and more. https://images.ourontario.ca/Brock/2817492/gallery

Community collections often highlight significant citizens. We want to broadcast some of the stories we’ve found from our VITA client collections. For example: Bob Turner, previously a catcher for the Chicago White Sox, Turner became Colborne’s first Recreational Director https://vitacollections.ca/cramahelibrary/355/exhibit.

Halton Region boasts the remarkable Veteran Henry Thomas Shepherd, who fought in both World Wars and whose story is shared in a virtual exhibit created by Halton Hills Public Library https://vitacollections.ca/HaltonHillsImages/558/exhibit.

And read on about Dr. Saint-Firmin Monestime. Born in Port-au-Prince, Haiti, Dr. Monestime moved from Haiti to open a practice in Timmins, but a chance encounter in a restaurant convinced him to put down roots in Mattawa instead, a Northern Ontario town where he would later became the first Black mayor in Canada: https://vitacollections.ca/multiculturalontario/476/exhibit/17.

Black History in Canada includes advocacy and civil action for human rights here and around the world. The Centre for Research on Latin America and the Caribbean (CERLAC) houses resources for researchers and scholars and features a fascinating Black Voices collection on their site https://vitacollections.ca/cerlacresourcecentre/search.

Researchers of all kinds use the collections to find history and illustrations for material of all kinds. Recently, a researcher contacted us with thanks for the transcripts of Schooner Days for background material about Caymanian Captain Culrose McLaughlin (1896-1992) and two-time Canada’s Cup winner Commodore Aemilius Jarvis (1860-1940) for her article Black Yachting History.

We’re privileged to be able to promote and share these collections online, resources that can help us celebrate Black History every month. If you have a story or collection you want highlighted, contact us at info@ourdigitalworld.org.

Banner image “Henry and Susanna Maude Shepherd Family” courtesy of Halton Hills Public Library.

Using historic timelines to describe multicultural experiences

This is a guest post by Guanqiao (Tony) Fu, a student in the History program at University of Toronto Mississauga. 

As an intern, I am working with OurDigitalWorld on creating digitized historical timeline exhibits. This is an opportunity for me to learn new skills and knowledge while putting my historical research skills to use.

Image of article titled Town Celebrates Portuguese Month, from the Oakville Beaver, May 7, 2003
Town Celebrates Portuguese Month, Oakville Beaver, May 7, 2003

Working with OurDigitalWorld to create timelines to tell stories of others using creative exhibits is the exact experience that will refine my research skills while reminding me that studying history puts the story of humanity in our hands. I am working on three exhibit projects, focusing on the Portuguese Canadian Diaspora, Chinese Migrants in Canada, and Ontario’s Experiences of Wars and Conflicts for the past 100 years. Finding and selecting primary sources while interpreting them in ways of bias-free story-telling is perhaps the greatest challenge I have encountered so far, and it is also where I learned the most. Through the past three months of researching, organizing and training, I have become more familiar with the ways of understanding primary sources and how to use my understanding to produce works—a necessary skill for historians to succeed.

Photo of Portuguese protesters at Toronto City Hall 1971
Anti-fascist and anti-colonial protest by PCDA members on Nathan Phillips Square, 1971

What is most interesting about my experience creating timelines with OurDigitalWorld is perhaps my positionality as a Chinese international student studying in Canada. I have been staying in Canada since the COVID-19 pandemic hit and my own cultural background and social bonds enabled me to interpret the historic development here from a different perspective. I often have the feeling that even though I have no difficulties interacting and connecting with people here as an international student, I still feel like an “outsider” who cannot fully interpret or comprehend other people’s experiences. For example, I discovered material about Portuguese settlers in the 1970s actively participating in the pacifist protest against Portugal’s military operations in Africa despite their Canadian nationality. I first found it hard to interpret living as both Portuguese and Canadian at the same time because I grew up in a monocultural society where this diversity is unimaginable. I could not help but try to interpret Portuguese settlers’ identities separately, and it was not until after I read about the communal networks Portuguese settlers established to honour their traditions did I realize that it is possible for a community to be proud members of several societies without breaking off from their own culture. Different communities have their own culture, and these cultural traditions continuously develop despite how each community chooses to express them. There is no absolute contradiction between cultures, only how people interpret and treat how each expresses themselves.

And this is what historical study is all about! No matter how we express our perception towards a topic, a community, or perhaps an event, we inevitably use our own perspective, which is shaped by our own culture and experiences. Working with OurDigitalWorld allows me to further reflect on my own positionality, to expand my own knowledge of history while expressing my findings using timelines that are both creative and informative. To me, history is about the greater cause of telling humanity’s story by collecting representative instances and putting them together as a portrait of our civilization, and this practical work is both an exciting and enjoyable experience that extends my career as an undergraduate student far beyond classrooms and libraries.

See Guanqiao’s exhibit and timeline about the Portuguese Diaspora in Ontario here.

OurDigitalWorld Newsletter December 2022

As 2022 draws to a close, our latest Quarterly newsletter offers some wonderful news and updates on important work:

  • Celebrating 10 years of OurDigitalWorld
  • 90 Years after the Holodomor
  • Projects
    • Enacting Reconciliation
    • Ensuring Accessibility
    • OurOntario.ca upcoming upgrades
  • Digital Collection Highlights
    • Making news in the Durham Region
    • Explore the Greater Chicago Area
  • Register now for OLA Super Conference 2023

Read more here

Ensuring Accessible Digital Collection Sites

This is a guest post by Olivia Najdovski, student at University of Toronto iSchool.

Desktop with green arrow on screen, surrounded by ear icon, eye icon, brain icon, and hand icon

One in five Canadians have a disability. As such, it is critical to consider the accessibility of websites to ensure that they are accessible to all. From October to December, I worked with OurDigitalWorld to conduct an accessibility audit of the VITA Digital Collections Toolkit base site code. The goal of this project was to achieve accessibility for sites as per the Web Content Accessibility Guidelines 2.0 (WCAG) guidelines. This process involved using WebAIM’s WAVE Browser Extension in addition to manual reviews to flag accessibility issues on the Toolkit sites, using both Safari and Chrome browsers. 

More specifically, the manual review involved combing through each individual webpage to pinpoint issues relating to keyboard accessibility and screen-reader compatibility. Some accessibility functionality is built in to the toolkit, like creating alt text for images from their titles, but this review process revealed some key discoveries that, with the web development talents of the ODW team, we were able to resolve. 

One issue we resolved across the Toolkit sites was a lack of labelling on buttons. When buttons or links are not accurately labelled, screen readers cannot pick up on what the purpose of that button or link is. Therefore, screen reader users cannot make use of the button, because the screen reader cannot relay what the button does. To remedy this, we ensured that buttons and links were accurately labelled across the toolkit sites, significantly improving the accessibility of the sites for screen reader users.

WAVE add-on testing a VITA Toolkit site
Sample application of the WAVE add-on during testing of a VITA Toolkit site

The great news is that incorporating small changes like labelling buttons and including additional informative alt text for images improved the accessibility and inclusivity of the Toolkit websites. Accessibility is an ongoing process, however, and can be compromised with any client-based content or site changes over time. This is a good step forward in keeping with ODW’s mandate of providing full and inclusive public access to community digital collections.

Some resources for WCAG review:

What’s new with VITA 6.4

VITA Digital Collections Toolkit was upgraded in September 2022, making it easier for user to provide better attribution and search results. This version upgrade means users can automatically assign copyright labels, process text items with OCR and hit highlighting, and share improved display for linked index records and more…

Exciting new changes include:

  • digital files uploaded as category “page” can automatically generate OCR and apply hit highlighting to search results – great for newspaper issues, documents, even headstone photos!
  • copyright holder statements can be automatically applied to serial publications 95 years old or younger (here’s how)
  • index records with links to digital pages will now display the linked page image in the details panel instead of the sidebar
  • personal information and cookies policy statements are now available for both VITA users and the public
  • apply “section” fields for non-newspaper pages e.g. Chapter headings
  • updated “help” for on-screen support (and correlating MAP updates)

Want to stay up to date with VITA Toolkit news? Use the subscription form on the home page of the VITA Help site.

ODW Quarterly Newsletter: September 2022

Farewell to Summer!

We hope you’ve had a wonderful few months to relax and take a break from the hustle and bustle. We’re gearing up for a busy season and looking forward to everything this autumn will bring!

This Quarterly shares some exciting news: upcoming collaborations, digital collection highlights, VITA Toolkit upgrades, conferences and save-the-dates.

Read the full newsletter here.

Painting by George Wolfe courtesy of Thames Art Gallery Permanent Collection

Inconvenient Exposure: Managing controversial content in digital collections

From family history to wrongful arrests to genocide denial, our community collections are reaching more people in more places, and not everyone is happy about it. So, how do you handle online pushback about your digital collections? Is it censorship or good policy to remove a newspaper article from the collection because someone’s checkered past is affecting their present? What happens when a collection sheds new light on a controversy?

This session discusses a wide array of examples of individual and community response to controversial content online. ODW Projects Coordinator Jess Posgate talks about how organizations are managing everything from personal information removal requests to hacked servers as new or buried narratives emerge through digitization. The session hopes to instigate conversation around planning digitization of controversial – or potentially controversial – material with respect and honesty, audience experience with in-house policies around personal information, and idea sharing for sustainable and comprehensive community representation online.

Presenting at the 2022 conferences for audiences at Ontario Library Association Super Conference and Atlantic Provinces Library Association, Jess Posgate walks through scenarios that might be familiar to some and provides tips on creating organizational policy to safeguard our community members when local history goes global.

Pilot: Handwritten Character Recognition (HCR)

As part of our digitization post production services, ODW has been achieving excellent results processing handwritten materials with Google’s Optical Character Recognition software. For a pilot project, we processed approximately 1120 duplex pages of pre-1910 handwritten Parish registers (births, marriages, deaths, mainly baptisms) digitized from public-use microfilm. Despite the quality of the images (scratched film and high contrast photography) the page images were split, deskewed, cropped and run through the  OCR software for some very rewarding results.

Applying this to our ongoing work with the Federated Women’s Institutes of Ontario (FWIO), we processed a recent batch of scrapbooks from the Grace Patterson Branch to provide full text search of the entire contents whether handwritten or typed. For all-in-one projects we will continue to apply the HCR software

Moving forward, we intend to experiment with Microsoft’s Azure HCR support which may be surpassing Google’s project — definitely worth trying to compare some pages! The development of HCR is burgeoning at companies like Google and Microsoft, so we can expect progressively better results over time.