LUSTRE Workshop 4 Report

Since August 2022, the AHRC-funded LUSTRE project has explored how Artificial Intelligence (AI) can unlock the potential of born-digital and digitized government archives. The project has engaged in numerous activities, including a journal special issue on “When data turns into archives: making digital records more accessible with AI” (forthcoming in AI & Society). Alongside these outcomes, the project has hosted four online lunchtime talks and four workshops.

The culmination of these efforts was LUSTRE Workshop 4, “The Future of AI to Unlock Digital Records,” which explored the emerging trends, challenges, and transformative innovations shaping the future of AI in the GLAM (Galleries, Libraries, Archives, and Museums) sector. Set against the backdrop of unprecedented technological advancements with AI, this workshop aimed to spark dialogue, foster collaboration, and inspire action among government professionals, GLAM sector professionals, and academics.

This fourth workshop, organised with the support of the Science Museum in London, on 27 and 28 June 2024. The presentation slides and abstracts from this workshop can be accessed HERE.

The first day of the workshop focused on institutional challenges in AI and archives, as well as the ethics of AI.

The first session, “AI and Archives: Institutional Challenges,” started with a talk by Dr James Lappin from the Central Digital Data Office. In his presentation on “Approaches for Using AI to Manage Records at Scale,” Lappin discussed pathways for applying AI to the management of digital records. He emphasised the use of AI to enhance records management by clustering data and separating valuable business content from trivial information. This approach could significantly streamline the process of sifting through vast amounts of data, making it easier to identify and preserve important records.

Lappin pointed out that data clustering algorithms could be used to create new clusters within any aggregation (for example an email account) that contains a wide variety of content. Related correspondence within an email account could thus be clustered together under a common tag or classifier. This would support the filtration of trivial and social content from within individual accounts. It would also offer a potential way for individuals to grant their successor-in-post access to business content within their individual account.

Overall, Lappin explored ways of managing the tension between using AI to improve the way records are organised, and preserving the context of records, including the way they were originally aggregated.

Next, Callum McKean from the British Library (BL) presented a talk on AI applications for personal digital archives.” McKean highlighted the challenges and successes in developing AI-based approaches for managing large, varied collections often stored on legacy media at the BL. He shared specific examples of how AI is being used to process and preserve personal digital archives with significant data protection issues.

For example, the Hybrid Correspondence Collections (2023) used Python and Gephi to visualise the paper and digital correspondence of Harold Pinter as networks. This involved manually creating metadata sheets for subsections of paper correspondence, and Python script for extracting equivalent data from Header and Body of emails. The metadata was then enriched, via geo-coded IP addresses for example. The project resulted in the creation of Python script for the creation of GDPR compliant email metadata, as well as visualisations in Gephi.

McKean pointed out that it is not inevitable that hybrid and born-digital collections will be made widely available in the future. There is a possibility that archival institutions will serve as storage for largely inaccessible records. For McKean, “our ability to leverage in any future ‘AI revolution’ relies upon our ability to build capacity now.” Experimentation, as well as taking the risk of failure, are essential to build this capacity.

The last talk of the session was given by Dr Lise Jaillant, who provided a user’s perspective on “The Future of Access to Digital Records,” discussing the complexities of access and the potential for AI to improve user experience and metadata creation. She emphasised the importance of making digital records accessible to a broader audience and how AI can add new layers to metadata, thus enhancing discoverability and usability of archival materials.

In the panel discussion following this session, participants discussed the ongoing challenges in developing robust AI solutions. Another question addressed the future of large language models, Lise suggested using small samples to experiment and play with these models, highlighting the need for practical engagement to understand their capabilities and limitations.

The second session delved into “AI Ethics and Archives.” Professor Claire Warwick from Durham University discussed “1990s Cyberspace and the Future of AI Policy,” drawing parallels between early internet hazards and current AI challenges. She highlighted the importance of ethical considerations in the development and deployment of AI technologies, warning against potential pitfalls that could arise from neglecting these aspects.

Following this, Dr David Brown from Trinity College Dublin explained how integrating tools like Transkribus and ChatGPT can streamline transcription and generate valuable insights from archival documents, making them more accessible and easier to analyse.

Indeed, the integration of Transkribus, a state-of-the-art AI tool for transcribing historical documents, and ChatGPT-4o, an advanced language model, can enhance research into the history of Early Modern Ireland. The project presented by Brown involved converting over 10,000 images of seventeenth-century state papers and other archival documents into machine-readable text using Transkribus. This vast corpus was then utilised to train a custom model within ChatGPT-4, tailored specifically for historical research.

Researchers were able to pose sophisticated research questions and receive coherent, contextually relevant responses. Additionally, the talk showcased various outputs generated by ChatGPT-4, including summaries of complex documents, extraction of metadata, and the generation of commentaries and illustrations. These outputs exemplify the model’s ability to assist in both the interpretation and presentation of historical data, providing a powerful tool for historians and researchers.

Concluding the session, Rebekah Taylor and Angharad Turner from the Independent Office for Police Conduct addressed the critical question, “Getting Ready for AI at IOPC: First, What Is the Problem and Is AI the Answer?” They discussed the organisational context and the challenges of integrating AI into records management. Their talk emphasised the need to clearly define the problems AI is intended to solve and to assess the level of risk associated with the application of AI. Taylor and Turner also shared resources within the government sector that have been useful for their work.

In the subsequent panel discussion, participants debated the ethical implications of using AI in student assignments, the readiness of institutions like the Science Museum Group to adopt AI, and strategies to involve more people from the GLAM sectors in AI initiatives. These discussions highlighted the necessity of involving a broader range of stakeholders in these conversations.

Day 2 shifted focus to future opportunities and practical implementations of AI in the GLAM sector. The morning session, “AI and the GLAM Sector: Opportunities and Challenges,” began with a presentation by Nicole Coleman from Stanford University discussing “Legal Design for an AI Future: A Case Study in Law Enforcement Policy Manuals.”

Coleman presented a project to discover and make publicly accessible data within California Law Enforcement Agency policy manuals. These policy manuals were made public in 2020 by an act of California legislation. The goals of the legislation are to help educate the public about such information by making it easily accessible, to increase communication and community trust, enhance transparency, and to save costs and labour by reducing the quantity of individual information requests.

However, compliance with the law as written does not necessarily mean the materials are easily accessible online. The team has applied large language models and attempted pulling semi-structured data out of the documents into a more easily analysable and accessible form. Coleman’s conclusion is that applying AI to information retrieval of this kind is not easy and requires costly verification work to ensure accuracy.

Following this, Dr Javier de la Rosa from the National Library of Norway shared insights on their extensive digital collection and the AI lab’s work in digitisation and access improvement. He highlighted the innovative uses of AI in the library’s operations, including automating the digitisation process and improving the accessibility of their collections. In particular, de la Rosa gave an overview of the National Library of Norway’s pioneering work developing AI models for Norwegian languages, including Sámi, ensuring accuracy and inclusivity.

The final talk of the session was given by Professor Paul Gooding from the University of Glasgow, who concluded the session with “iREAL: Indigenising Requirements Elicitation for Artificial Intelligence in Libraries.”

The talk provided an overview of the iREAL project, which is funded as an AHRC Bridging Responsible AI Divides scoping project to develop a model for Responsible AI systems development in libraries that are seeking to include knowledge from Indigenous communities, specifically Aboriginal and Torres Strait Islander peoples in Australia. Co-designed with indigenous communities, the project will lead to a model that can be applied when considering whether, and how, to use Indigenous knowledges in library-developed or adopted systems.

A breakout group session then provided participants with the opportunity to discuss specific questions in smaller groups. Key questions and discussion included:

1. How can AI technologies be used to enhance access to digital records for various user groups, including researchers and the general public?

Discussions highlighted that AI could significantly improve access to digital records by summarising information, creating more accurate and comprehensive metadata, improving search functionalities, and developing user-friendly interfaces. Challenges such as dealing with sensitive information and ensuring ethical use of AI were also discussed.

2. What are the main challenges institutions face when implementing AI solutions for digital archives, and how can these challenges be overcome?

Participants noted that institutions often face technical, financial, and organisational challenges. Technical challenges include the need for robust infrastructure and skilled personnel. Financial constraints can limit the ability to invest in necessary technology and training. Organisationally, there may be resistance to change and a lack of understanding of AI’s potential. Collaborative approaches and clear communication of AI benefits were suggested as ways to overcome these challenges. Strategies to engage with senior colleagues regarding AI and its application to records were also discussed.

3. What ethical considerations should be taken into account when applying AI to digital archives, and how can institutions ensure that their use of AI is responsible and fair?

The discussions emphasised the need for transparency in AI processes and the importance of addressing biases in AI training data. Ensuring that AI decisions are understandable and explainable to users was highlighted as crucial for maintaining trust. Participants also noted the risk of biased data leading to biased outcomes, stressing the importance of ethical AI training and decision-making.

4. In what ways can AI assist in identifying and managing sensitive information within digital archives, and what measures should be taken to ensure accuracy and privacy?

AI’s potential to identify sensitive information was acknowledged, but concerns about accuracy and privacy were raised. Developing clear guidelines and protocols for AI use in this context was suggested to mitigate risks. Additionally, ensuring regular review and updates to AI models was discussed.

5. How might the integration of AI change the workflows and daily tasks of archivists, and what new skills or training might be required?

The integration of AI is expected to transform archival workflows by automating routine tasks and allowing archivists to focus on more complex issues. Training in AI technologies and data management was identified as crucial for adapting to these changes. Participants also noted that while AI can be disruptive, it is essential to plan for its integration naturally and focus on its role as a tool to enhance, rather than replace, human expertise.

Participants also discussed the usefulness of various outputs from the workshop, including case study reports specific to the GLAM sectors, recommendations on applying AI to archives, and ways to support ongoing conversations about AI. Sharing experiences with peers, understanding the practical applications of AI, and strategies to get senior colleagues on board were also highlighted as important outcomes.

After the breakout group session, the final session addressed “The Future of AI and Archives.” Professor Richard Marciano from the University of Maryland discussed “Harnessing Generative AI to Support Exploration and Discovery in Archival Collections,” emphasising the role of generative AI in improving archival exploration and discovery.

This talk explored the use of Large Language Models to facilitate the analysis and visualisation of newspaper advertisements from the Maryland State Archives related to the trading of enslaved people. The study focuses on the Domestic Traffic Ads collection of the State of Maryland between 1824 and 1864, which exposes chattel slavery practices where buyers and sellers would interact to exchange and share human beings often for social and domestic benefit. This case study is part of a larger project to explore computational treatments to remember the Legacy of Slavery (CT-LoS), towards reasserting erased memory.

Following it, David Canning and Kelcey Swain from the Cabinet Office presented “Using AI to Review Records in Cabinet Office,” detailing their methodology for handling vast amounts of digital records and the benefits of AI in achieving efficiency and accuracy. Their talk provided practical examples of how AI is being used to streamline records review processes at the Cabinet Office (CO).

With a “digital heap” consisting of millions of digital records, the CO team experimented with a digital methodology to aggressively reduce the number of records without historical relevance. Advantages of this review process include speed, consistency and accuracy.

The final talk of the session and the workshop was given by Dr Tim Boon from the Science Museum, who concluded with a presentation of the AHRC-funded project “Congruence Engine.” Boon discussed innovative digital tools for exploring industrial histories. He illustrated how the Congruence Engine leads to linking disparate data sources, creating a richer and more connected understanding of historical collections.

The workshop concluded with this panel discussion, summarising the key insights from the session. Topics discussed included the use of AI in oral history projects through speech-to-text and topic modelling, the continuous updating of language models, the fragility of cultural memory and ways to protect it.

Overall, the fourth LUSTRE workshop offered valuable insights into the intersection of AI and born-digital archives, emphasising the importance of continued discussions and collaborations to address emerging challenges and harness AI’s potential.