Science, Technologies, Innovations №1(21) 2022, 29-37


Anda Baklāne — Master of Philosophy, Researcher and the Head of Digital Research Services at the Department of the Development of Digital Services, at the National Library of Latvia, 3, Mūkusalas Str., Riga, Latvia, LV-1423; +(371)67806100;; ORCID: 0000-0002-0301-2504

Valdis Saulespurēns — Master of Computer Science, Researcher and data analyst at the Technology Department of the National Library of Latvia, 3, Mūkusalas Str., Riga, Latvia, LV-1423; +(371)67806100;; ORCID: 0000-0002-9665-0125


Abstract. In the last 20 years, topic modeling and the application of LDA (latent Dirichlet allocation) model in particular has become one of the most commonly used techniques for exploratory analysis and information retrieval from textual sources. Although topic modeling has been used to conduct research in a large number of projects, the technology has not yet become a part of the common standard functionalities of digital historical collections that are curated by the libraries, archives and other memory institutions. Moreover, many common and well researched natural language processing techniques, including topic modeling, have not been sufficiently applied to working with sources of small or low-resource languages, including Latvian. The paper reports the results of the first case study where the LDA methodology has been used to analyze a data set of historical newspapers in Latvian. The corpus of the newspaper Latvian Soldier is used to conduct the analysis, focusing on the performance of the topics related to the first commander of Latvian army Oskars Kalpaks as an example. In the research of digital humanities, the results of the topic modeling have been used and interpreted in several distinct ways depending on the type and genre of the text, e.g., to acquire semantically coherent, trustworthy lists of keywords, or to extract lexical features that do not aid thematic analysis but provide other insights about the usage of language instead. The authors of this paper propose applications that could be most suitable for the analysis of historical newspapers in large digital collections of memory institutions, as well as recount the challenges related to working with textual sources that contain optical recognition errors, problematic segmentation of articles and other issues pertaining to digitized noncontemporary data.

Keywords: topic modeling, latent Dirichlet allocation, topic coherence, historical newspapers, natural language processing of Latvian, digital humanities, Oskars Kalpaks.


