Generative AI Use Cases for Archives

Workflows powered by generative AI and machine learning products have great potential to improve archival processes and facilitate access to invaluable cultural heritage material.

As the first independent Jewish school of higher education in North America, Gratz College stewards unique collections related to the American Jewish experience. As a data librarianship consultant for the Grayzel Digital Collections, I worked with Dr. Alison Joseph, the Director of Digital Scholarship at Gratz, to develop and evaluate AI-powered workflows for several archival use cases. This post reflects on the successes and challenges faced in the process.

Use Case 1: Oral History Audio Transcripts

The Grayzel Digital Collections platform will feature recordings and transcriptions from the Holocaust Oral History Archive, among the earliest collections of oral history interviews with Holocaust survivors in the United States. Primarily recorded at the American Gathering of Holocaust Survivors in 1983, the collection comprises more than 900 interviews, most of which have accompanying transcripts. However, approximately 150 interviews remained un-transcribed as of early 2024.

Evaluating Audio Transcription Tools

Using a sample of audio clips derived from the interview recordings, I tested three speech-to-text products:

Amazon Transcribe
Anthropic's Claude AI
OpenAI's Whisper

Whisper demonstrated a high level of accuracy with accented English speech and seemed to predict proper nouns well.

One challenge I encountered early on was enabling _speaker diarization_ alongside the speech-to-text functionality. Amazon Transcribe had a diarization feature built in, but I found it did not fit the project requirements well, as it is geared toward customer service call transcription rather than interviews.

Fortunately, I found a Python library named WhisperX that combines the Whisper model with the audio diarization library pyannote-audio. Running WhisperX locally, I was able to generate relatively high-accuracy transcription data for the un-transcribed interview recordings. I then converted the JSON data returned from WhisperX into Microsoft Word documents to meet the Grayzel Collection's transcription format guidelines. These preliminary transcripts were then handed off to human reviewers for refinement and quality control.

Use Case 2: Historical Handwritten Text Recognition (HTR)

Gratz has a remarkable collection of correspondence belonging to the College's founder Rebecca Gratz. Many of these letters remain un-transcribed, making digital access difficult.

Building on my experiences with the audio transcription project, I tested out Claude AI and OpenAI's ChatGPT to see how well the models could transcribe historical English handwriting. Both models did well, particularly with shorter texts that fit within a single context window for the AI model. Because I had more familiarity with the OpenAI API, I chose to use the OpenAI products for this project, though Claude AI performed very well.

As the handwritten letters were delivered to me in PDF form, I first needed to convert them to image files. For this, I used ImageMagick. Once I had suitable images of the letters, I prepared them for submission to the OpenAI Batch API, with the following included as a system prompt:

You are a historical handwriting recognition expert skilled at transcribing handwriting from the 1800s.

    1. When provided images of handwritten letters, please transcribe the letters carefully.

    2. Maintain line breaks where you see them.

    3. Don't interpolate or make up information.

    4. Return only the text of the letter, without any comments or introduction.

    5. Maintain punctuation exactly as it appears.

    6. Maintain spelling exactly as it appears, as the letters may use antiquated spelling.

    7. Maintain capitalization exactly as it appears.

    8. If you are uncertain about a word or phrase, put it in brackets [like this].

    9. Indicate page breaks using [end of page].

In several cases, the resulting transcriptions were surprisingly good, preserving historical spelling and maintaining lines even with skewed handwriting. *_In the end however, the model stumbled over proper nouns and created enough errors that it took more time for expert human reviewers to correct the problems than to simply transcribe the original texts themselves._*

In this case, the AI tools did not fit the existing workflow in place for the Grayzel Collections. Nonetheless, the AI models demonstrated high levels of verbal reasoning, and they could likely be useful in similar workflows for historical HTR. In at least one instance, the model predicted a word that had stumped the archivists. For these situations, AI-assisted HTR may be worth trying.

Use Case 3: Optical Character Recognition (OCR) for Archival Inventories and Finding Aids

In contrast to handwritten text, the AI models I tested performed very well at optical character recognition of printed documents. This enabled an extract-transform-load (ETL) workflow, wherein PDFs of archival inventories and finding aids goes in one end, and structured JSON data comes out the other end.

By providing OpenAI's Chat Completions API with a JSON Schema object in the response_format property, extracted information can be reliably transformed into JSON. I was able to use this feature to extract box, folder, and item information from PDF inventories of items in Gratz's holdings. This data can then be used to facilitate digital discovery and access.

All in all, I have learned a lot about new generative AI products by working on these projects for Gratz College. The forthcoming Grayzel Digital Collections platform will feature the collections described in this post and others. It is sure to be an invaluable resource for those interested in American Jewish history and archives.