Machine interpretation of handwritten source material
Reference number | |
Coordinator | Riksarkivet |
Funding from Vinnova | SEK 400 000 |
Project duration | April 2020 - May 2021 |
Status | Completed |
Venture | AI - Competence, ability and application |
Call | Start your AI-journey! For public organizations |
Important results from the project
This project has examined how techniques in the field HTR (Handwritten Text Recognition) can be used against handwritten archive materials within the Swedish National Archives. A HTR model that automatically interprets 22,500 text pages from the second half of the 19th century has been created. The model has been trained on 940 manually transcribed pages ("ground truth"), created by volunteers, and gives a character error rate of 2,7%. The HTR model is available via the Transkribus platform and the texts are searchable at the National Archives´ website.
Expected long term effects
The HTR model transcribes a historical archive correctly to 97% - better than most people can do and considerably faster. About 6 months of manual work - mainly performed by volunteers - has been put in place to create the training data for this project. It would probably have taken at least 6 years to transcribe the entire material manually. The potential of using HTR is thus great. An HTR model can also form the basis for new models, adapted for other materials. HTR will be a powerful tool for genealogists and local history researchers and data-driven research.
Approach and implementation
The work of creating HTR models and manual transcripts has taken place in the Transkribus platform. AI tools have in this way been combined with citizen science. Before the images are transcribed, the text lines need to be identified. This is done automatically but requires manual corrections. This has taken more time than expected. The HTR texts have then been exported to XML in standard format (ALTO, PAGE, and TEI). Close collaboration with external actors, researchers, and volunteers are an important part of the continued work with HTR at the National Archives.