Daniela Mârza
Transcribing Historical Population Sources Written in Cyrillic: Methodological Challenges in Training HTR Models for Romanian Parish Registers in Transylvania
Daniela Mârza
Article Information
Pages: 127-152
DOI: https://doi.org/10.24193/RJPS.2025.2.05
Daniela Mârza*
*Babeş-Bolyai University, Centre for Population Studies, Cluj-Napoca, Romania, elena.marza@ubbcluj.ro
Abstract. The large-scale digitization of archival holdings has created new opportunities for historical population research, but effective access to handwritten sources remains limited due to the absence of reliable automatic transcription tools. This paper presents the training of a Handwritten Text Recognition (HTR) model for the automatic transcription and transliteration of Romanian parish registers written in Cyrillic characters, a category of sources that constitutes a substantial yet difficult-to-access component of modern Romanian documentation. Focusing on Orthodox parish registers from Transylvania dating from the early nineteenth century, the paper combines a historical overview of Romanian Cyrillic writing with a methodological discussion of transliteration and transcription practices, followed by an empirical assessment of trials conducted using the Transkribus platform. The results obtained so far reveal high Character Error Rates and demonstrate that the standard HTR training workflow is insufficient for producing a functional automatic solution for this type of material. The main obstacles arise from the lack of orthographic standardization, the structural mismatch between the Cyrillic alphabet and the Romanian language, graphic polysemy, abbreviations, superscriptions, irregular spacing, and significant variation in handwriting. These difficulties show that transcription in this context cannot be fully automated and must instead be approached as a hybrid, semi-automatic process that integrates HTR, rule-based transliteration, lexical validation, and sustained human intervention.By documenting both the progress achieved and the limitations encountered, this article contributes to ongoing debates in digital humanities and historical demography regarding the applicability of artificial intelligence to complex historical sources. It argues that, despite current constraints, even imperfect automatic transcriptions can significantly enhance accessibility and research efficiency, provided their use is methodologically transparent and critically informed.
Keywords: automatic transcription, handwritten text recognition (HTR), Romanian Cyrillic, parish registers, historical demography, digital humanities, Transkribus
References
References
Babayev, J. (2025). “Orthographic Challenges in the Transliteration of Proper Names between the Languages with Different Spelling”. Acta Globalis Humanitatis et Linguarum 2(4): 345-356.
Bența, D., Bud, P., Platon, E., Paşca-Tuşa, S., Oneţiu, E., Mihăilă, A., & Floca, F. (2020). “Challenges in proofing the Cyrillic MCVRO resources – Equability between the technical component and the role of the researcher”. Philobiblon 25(2): 337-353. https://doi.org/10.26424/philobib.2020.25.2.09
Bianu, I., & Cartojan, N. (1940). Album de paleografie românească (scrierea chirilică) (Ed. a III-a). Bucureşti: Cartea Românească.
Boroianu, C. (1971). Texte vechi româneşti. Album de paleografie româno-chirilică. Bucureşti: Universitatea din Bucureşti.
Burlacu, C., & Rabus, A. (2021). “Digitalizarea scrierilor cu alfabet chirilic (românesc) prin utilizarea platformei Transkribus: noi perspective”. Diacronia 14 (A196): 1-10. https://doi.org/10.17684/i14A196ro.
Caers, B. (2024). “Teaching handwritten text recognition: Can new technologies save old skills?” Quærendo 54: 198–209. https://doi.org/10.1163/15700690-bja10024.
Cristea, D., Pădurariu, C., Rebeja, P., & Onofrei, M. (2020). “From scan to text: Methodology, solutions and perspectives of deciphering old Cyrillic Romanian documents into the Latin script”. In Knowledge, Language, Models, pp. 38–56.
Cristea, D., Cleju, N., Rebeja, P., Haja, G., Coman, E., Vasilescu, A., Marinescu, C., & Dascălu, A. (2023). “Bringing the old writings closer to us: Deep learning and symbolic methods in deciphering old Cyrillic Romanian documents”. Memoirs of the Scientific Sections of the Romanian Academy 46: 87–125.
Dragnev, D., & Gumenâi, I. (2003). Paleografia slavo-română şi româno-chirilică. Chișinău: CIVITAS.
Frincu, M., Frincu, S., & Penteliuc, M. E. (2023). “Challenges and solutions in transliterating 19th century Romanian texts from the transitional to the Latin script”. In Proceedings of the 4th Conference on Language, Data and Knowledge,Vienna, Austria. NOVA CLUNL, Portugal, pp. 226–231.
Petic, M., & Gîfu, D. (2014). “Transliteration and alignment of parallel texts from Cyrillic to Latin”. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), pp. 1819–1823.
Grama, A. (1989). “Mesajul scrisului românesc din documente sătești transilvănene (1810–1860)”. Revista Arhivelor 51(4): 349–356.
Iliev, I. (2013). “Short history of the Cyrillic alphabet”. International Journal of Russian Studies 2(2): 221-285.
Leifert, G., Romein, C. A., Rabus, A., Ströbel, P. B., & Hodel, T. (2024). Transkribus and beyond: Pioneering the future of transcription technology. Royal Netherlands Academy of Arts and Sciences.
Malahov, L., Colesnicov, A., Cojocaru, S., & Bumbu, T. (2017). “On recognition of manuscripts in the Romanian Cyrillic script”. In Proceedings of the Conference on Mathematical Foundations of Informatics (MFOI 2017). Chişinău, Republic of Moldova.
Muehlberger, G., et al. (2019). “Transforming scholarship in the archives through handwritten text recognition: Transkribus as a case study”. Journal of Documentation 75(5): 954–976. https://doi.org/10.1108/JD-07-2018-0114.
Nockels, J., Gooding, P., Ames, S., & Terras, M. (2022). “Understanding the application of handwritten text recognition technology in heritage contexts”. Archival Science 22: 367–392. https://doi.org/10.1007/s10502-022-09397-0.
Vasilescu, V. & Boiangiu, A. (1982). Scrierea chirilică românească. Album de paleografie. Bucureşti : Universitatea din Bucureşti.
Rebeja, P. (2023, November 6). Digital analysis of old Romanian texts (Extended abstract). Universitatea „Alexandru Ioan Cuza” din Iaşi. https://scdoc.info.uaic.ro/wp-content/uploads/2023/11/Rezumat-teza-EN_Rebeja-Petru.pdf
Terras, M. (2022). “Inviting AI into the archives: The reception of handwritten recognition technology into historical manuscript transcription”. In L. Jaillant (Ed.). Archives, access and artificial intelligence: Working with born-digital and digitized archival collections. Bielefeld University Press, pp. 179–200. https://doi.org/10.14361/9783839455845.
Tikhonov, A., Loew, L., Matić-Chalkitis, M., Meindl, M., & Rabus, A. (2023). “Multilingual handwritten text recognition (MultiHTR): Reading your grandma’s old letters in German, Russian, Serbian, and Ottoman Turkish with artificial intelligence”. In A. Schwan & T. Thomson (Eds.). The Palgrave handbook of digital and public humanities. Palgrave Macmillan, pp. 1–18. https://doi.org/10.1007/978-3-031-11886-9.
Vakulenko, M. (2024). “Transliteration of non-Latin texts: From everyday practice to linguistic technologies”. Proceedings of the World Conference on Foreign Language Education 1(1): 1–11. https://doi.org/10.33422/worldfle.v1i1.545.
Vîrtosu, E. (1968). Paleografia româno-chirilică. Bucureşti: Editura Ştiinţifică.
Zagórski, B. R. (2015). Difficult historical problems of transliteration and transcription in South-Eastern European toponomastic practice. Paper presented at the UNGEGN–ECSEED Meeting, Ljubljana, Slovenia.