Improving access to heritage newspaper content: Replacing microfilm with original paper scans

From guest blogger Walter Lewis, Great Lakes historian and software developer for OurDigitalWorld.

Some years ago the Center for Archival Collections at Bowling Green State University organized the microfilming of many of the early issues of both the Marine Record and the Marine Review up to end of 1902. In 2010, we added just short of 17,000 pages from that microfilm to the Maritime History of the Great Lakes website covering the years 1883-1902. Thanks to issues shared by the Dossin Museum in Detroit, along with Ron Beaupre and Greg Rudnick, I have been able to both extend the coverage of the Marine Review to its end in 1935, but also to replace all but 2500 pages of the microfilm with images from the originals. The result is just over 55,000 pages of marine journalism published in Cleveland, Ohio. The journals had deep roots in Great Lakes shipping and although from World War I, there was an increasing emphasis on global developments.

One question I have been asked is “why go to the time and effort to re-shoot the issues from the originals?” A couple of examples may explain why.

Almost all microfilm is photographed in black and white, with an emphasis on high contrast exposures that improves the ability to read the text on standard microfilm readers. The company that digitized the BGSU microfilm emphasized this contrast in the files they produced for us. For pages from the era of woodcut engravings this is less of an concern, although the additional generation of negative/positive print before digitization can still introduce focus issues. The challenge in many films comes from shadows in gutters in instances when the paper wasn’t disbound before filming (true here). Content in those columns may come up very dark, and after digitization, black on black. In part this is because many digitization projects, especially ones done ten or more years ago, were struggling to reduce file size and assumed that bitonal (aka each pixel in the image is either black or white) images would be acceptable. In some instances they are. But with the increasing use of photographs in the 1890s, the degree of greyness at which a given point on the page was converted to either black or white, makes for some very unhappy images. The Marine Review prided itself on its illustrations. Reshooting these, not just in greyscale, but in colour restored a significant amount of detail. This was especially true, when some earlier owner of the issue marked it up with a blue or other coloured pencil.

Image from a microfilm scan
From Microfilm

The conversion to bitonal files also has a significant impact on the quality of the Optical Character Recognition (OCR) of the files. This is a computer process that converts the images of the text to text that can be searched in our indexes. When, for example, letters have parts that print more faintly, or where there is bleed-through from the ink on the other side of the page, the results are far from satisfactory.

From the Microfilm

Iuststpruentthereisalittle flurryin Wuhingtoubetween the navy department and the Marine Hospital service. navy ‘departlnent has recently yent 050.000 establishing a coding nation at Dry Tor- tuga: an In
equt wha considers. u the island. the most im- rortan ‘ 1ss::erntheChesa eand Central America. A ew bp g was en rised to receive a notifiatiqn from the ta-usury department to stop war at Dr! TOTIIIEII 5! t\P”‘ 1. 55 Surgeon General W needed the place to are for yellow lever and
bubonic plague patients. The ma thinks that the-e are sevwal other adjacent s a avail: . ‘lfllfvgaa and will de- elinetosurrenderDry ortugasnnlesslpecl yofildtdmdoiohr lhepréktthirnlell  .

From reshoot of original

A FLURRY OVER DRY TORTUGAS. Just at present there is a little flurry in Washington between the navy
department and the Marine Hospital service. The navy department_has
recently spent about $50,000 in establishing a coaling station at Dry Tor-
tugas and in equipping what it considers, upon the island, the most im-
portant strategic base between the Chesapeake and Central America. A
few days ago Secretary Long was surprised to receive a notification from
the treasury department to stop work at Dry Tortugas by April 1, as
Surgeon General Wyman needed the place to care for yellow fever and
bubonic plague patients. The navy department thinks that there are
several other adjacent spots available for hospital purposes and will de-
cline to surrender Dry Tortugas unless specifically ordered to do so by
the president himself.

There are still minor gaps in the files where pages are missing from issues, and a significant number of early issues are still missing, but the results are worth the effort. Now if we could only locate some issues of the Record from before 1883.

To read the full article, see Walter Lewis’ Maritime History of the Great Lakes website:

Expanding opportunities: Text Data Mining with Newspapers

In late February 2023, the Leddy Library’s Academic Data Centre at the University of Windsor hosted a workshop series called RDM & TDM in JupyterHub with Newspapers. TDM is an acronym for Text Data Mining (TDM) and one increasingly common approach to TDM highlighted in the workshop is the use of Optical Character Recognition (OCR) from newspapers for text processing. The imagery for several of the newspaper titles used for the workshop was improved to raise the OCR accuracy levels to better serve TDM technologies.

First observed with  the Feb. 4, 1892 edition of the Comber Herald, initial tests with Topaz suggested a 20% improvement in OCR accuracy. This past year has seen the entire collection reprocessed, which allowed the Herald to be included in the corpus for a TDM workshop held at the Leddy Library.
Another sample from 2021, this time the Sept. 22, 1971 edition of the Essex Times. Like the Herald, the Times was completely processed this past year with Topaz.

The series was funded with a grant from Compute Ontario and showcased OurDigitalWorld’s extensive history with newspaper digitization, as well as its long-standing partnership with the University of Windsor. The growing interest in TDM among Ontario libraries was further confirmed by a Colloquium on TDM in Libraries event held at the University of Toronto in early May 2023. The use of newspapers for TDM was a major theme for the colloquium and a common strategy was identified where newspaper collections become substantial data assets for text processing.

ODW collections not only formed the basis of the Windsor workshop series, a subsequent data challenge using the newspaper collection was launched in March, which featured a collaboration with Hackforge, and a kick-off event at the LaSalle public library in partnership with the Essex County Library System. The Compute Ontario grant that supported these activities also provided funding for two graduate students, Akram Vasighizaker and Sumaiya Deen Muhammad, to carry out original research and the results are publicly available on the workshop github site.

TDM is an exciting new direction for newspaper digitization and represents a convergence between recent advances in artificial intelligence (AI) and machine learning with what is frequently the most extensive record of a community’s past, the local newspaper. Unique insights into the past and identifying trends and patterns are enhanced with the power of TDM and digitized newspapers, and it is hoped that ODW can continue to help libraries contribute to this promising area of research.

What’s new with VITA 6.4

VITA Digital Collections Toolkit was upgraded in September 2022, making it easier for user to provide better attribution and search results. This version upgrade means users can automatically assign copyright labels, process text items with OCR and hit highlighting, and share improved display for linked index records and more…

Exciting new changes include:

  • digital files uploaded as category “page” can automatically generate OCR and apply hit highlighting to search results – great for newspaper issues, documents, even headstone photos!
  • copyright holder statements can be automatically applied to serial publications 95 years old or younger (here’s how)
  • index records with links to digital pages will now display the linked page image in the details panel instead of the sidebar
  • personal information and cookies policy statements are now available for both VITA users and the public
  • apply “section” fields for non-newspaper pages e.g. Chapter headings
  • updated “help” for on-screen support (and correlating MAP updates)

Want to stay up to date with VITA Toolkit news? Use the subscription form on the home page of the VITA Help site.

Pilot: Handwritten Character Recognition (HCR)

As part of our digitization post production services, ODW has been achieving excellent results processing handwritten materials with Google’s Optical Character Recognition software. For a pilot project, we processed approximately 1120 duplex pages of pre-1910 handwritten Parish registers (births, marriages, deaths, mainly baptisms) digitized from public-use microfilm. Despite the quality of the images (scratched film and high contrast photography) the page images were split, deskewed, cropped and run through the  OCR software for some very rewarding results.

Applying this to our ongoing work with the Federated Women’s Institutes of Ontario (FWIO), we processed a recent batch of scrapbooks from the Grace Patterson Branch to provide full text search of the entire contents whether handwritten or typed. For all-in-one projects we will continue to apply the HCR software

Moving forward, we intend to experiment with Microsoft’s Azure HCR support which may be surpassing Google’s project — definitely worth trying to compare some pages! The development of HCR is burgeoning at companies like Google and Microsoft, so we can expect progressively better results over time.