Beyond 2022 presents at Transkribus/READ Consortium conference in Vienna

Dr. Dave Brown of the Beyond 2022 project presented at the READ Consortium conference in Vienna on 9 November 2018 on how the exceptionally rich replacement materials identified by Beyond 2022 are suitable for HTR (Handwritten Text Recognition) processing, potentially rendering them fully-searchable. To date, Beyond 2022 has identified over 300 volumes from the Record Commission of Ireland (1810-30) suitable for such HTR treatment, collectively representing as much as 20 million words of text recoverable from the losses of the 1922 fire.

Some individuals transcribed up to 25,000 pages over a period of many years. With so many examples of very large quantities of text produced by a single hand, the Irish Record Office transcriptions might as well have been prepared with Transkribus in mind.

The collections reflect the cataloguing arrangements in the original record office and the largest sets of copies deal with topics central to the study of Irish history: the establishment of English governing structures in the Middle Ages, the Elizabethan conquest, the Plantation of Ulster, the Cromwellian occupation of Ireland, the Williamite wars and the breaking up of the great landed estates in the nineteenth century. All areas of history are covered in these transcripts, however, and the material includes early census-type records, trade, legal judgements and a wide range of smaller thematic collections related to specific towns and cities. The digitisation is most advanced for the Cromwellian period, 1650-1659, and the scale of documents recovered surpasses that which has survived for most parts of England.

Transkribus works very well on large, relatively uniform collections such as these. Several HTR models have been prepared for 15,000 words each. As the number of trained models increased, a separate project emerged to investigate if the existing models could be used to partially HTR a sample from the next set of documents, and speed up the process of creating each subsequent ground truth. It was decided to create a single page ground truth for each new example, and compare this with text automatically generated with each model in the project to find the best one to work with.