Science, Technologies, Innovations №1(21) 2022, 29-37

 PDF

http://doi.org/10.35668/2520-6524-2022-1-05

Anda Baklāne — Master of Philosophy, Researcher and the Head of Digital Research Services at the Department of the Development of Digital Services, at the National Library of Latvia, 3, Mūkusalas Str., Riga, Latvia, LV-1423; +(371)67806100; anda.baklane@lnb.lv; ORCID: 0000-0002-0301-2504

Valdis Saulespurēns — Master of Computer Science, Researcher and data analyst at the Technology Department of the National Library of Latvia, 3, Mūkusalas Str., Riga, Latvia, LV-1423; +(371)67806100; valdis.saulespurens@lnb.lv; ORCID: 0000-0002-9665-0125

THE APPLICATION OF LATENT DIRICHLET ALLOCATION FOR THE ANALYSIS OF LATVIAN HISTORICAL NEWSPAPERS: OSKARS KALPAKS’ CASE STUDY

Abstract. In the last 20 years, topic modeling and the application of LDA (latent Dirichlet allocation) model in particular has become one of the most commonly used techniques for exploratory analysis and information retrieval from textual sources. Although topic modeling has been used to conduct research in a large number of projects, the technology has not yet become a part of the common standard functionalities of digital historical collections that are curated by the libraries, archives and other memory institutions. Moreover, many common and well researched natural language processing techniques, including topic modeling, have not been sufficiently applied to working with sources of small or low-resource languages, including Latvian. The paper reports the results of the first case study where the LDA methodology has been used to analyze a data set of historical newspapers in Latvian. The corpus of the newspaper Latvian Soldier is used to conduct the analysis, focusing on the performance of the topics related to the first commander of Latvian army Oskars Kalpaks as an example. In the research of digital humanities, the results of the topic modeling have been used and interpreted in several distinct ways depending on the type and genre of the text, e.g., to acquire semantically coherent, trustworthy lists of keywords, or to extract lexical features that do not aid thematic analysis but provide other insights about the usage of language instead. The authors of this paper propose applications that could be most suitable for the analysis of historical newspapers in large digital collections of memory institutions, as well as recount the challenges related to working with textual sources that contain optical recognition errors, problematic segmentation of articles and other issues pertaining to digitized noncontemporary data.

Keywords: topic modeling, latent Dirichlet allocation, topic coherence, historical newspapers, natural language processing of Latvian, digital humanities, Oskars Kalpaks.

REFERENCES

  1. The main page of the official web portal National Digital Library of Latvia. Retrieved from: https://www.lndb.lv/.
  2. Krūmiņa, L. (2012). Digitalizācija Latvijā pasaules pieredzes kontekstā. Bibliotēku pasaule. Vol. 57. P. 39-45. Retrieved from: https://dom.lndb.lv/data/obj/file/162387.pdf.
  3. Zariņš, U. (2014). Eiropas kultūras mantojums digitālajā vidē. Latvijas intereses Eiropas Savienībā. No.2. P. 41-55. Retrieved from: https://dom.lndb.lv/data/obj/61436.html.
  4. The comprehensive list of the digitized periodicals can be viewed in the website periodika.lv. Retrieved from: https://periodika.lndb.lv/#allPeriodical.
  5. Ehrmann, M., Romanello M., & Clematide, S. et.al. (2020). Language Resources for Historical Newspapers: The Impresso Collection. LREC 2020 Proceedings. P. 958-968. Retrieved from: http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.121.pdf.
  6. Digital Approaches in Cultural Heritage: towards a pan-Baltic cooperation network: final report. Riga: National Library of Latvia, 2019. Retrieved from: https://dom.lndb.lv/data/obj/781145.html.
  7. McGillivray, B.; Schuster, K., Dunn, S. (Eds.) (2021). Computational methods for semantic analysis of historical texts. Routledge International Handbook of Research Methods in Digital Humanities. London; New York: Routledge, Taylor & Francis GroupP. 261-274.  https://doi.org/10.4324/9780429777028-20
  8. Bollmann, M. (2019). A Large-Scale Comparison of Historical Text Normalization Systems. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long and Short Papers) – Association for Computational Linguistics, P. 3885-3898.
  9. Abney, S., Bird, S. (2010). The Human Language Project: building a universal corpus of the world’s languages. Proceedings of the 48th Meeting of the Association for Computational Linguistics Association for Computational Linguistics. P. 88-97.
  10. Skadiņa, I., Veisbergs, A., Vasiļjevs, A. et al. (2012). The Latvian Language in the Digital Age / Latviešu valoda digitālajā laikmetā. META-NET White Paper. Berlin: Springer. https://doi.org/10.1007/978-3-642-30876-5_3
  11. Alabi, J.O., Amponsah-Kaakyire, K., & Adelani, D., et.al. (2020). Massive vs. Curated Embeddings for Low-Resourced Languages: the Case of Yorub` a and Twi. Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, P. 2754-2762.
  12. Alves, D., Thakkar, G., & Tadić, M. (2020). Evaluating Language Tools for Fifteen EU-official Under-resourced Languages. Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, P. 1866-1873.
  13. Blei, D. M. (2012). Topic modeling and digital humanities. Journal of Digital Humanities, 2(1). Retrieved from: http://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/.
  14. Blei, D.M., Ng, A.Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3 (January). P. 993-1022. Retrieved from: https://dl.acm.org/doi/10.5555/944919.944937.
  15. Marjanen, J., Zosa, E., Hengchen, S., et. al. (2020). Topic Modelling Discourse Dynamics in Historical Newspapers. Post-Proceedings of the 5th Conference Digital Humanities in the Nordic Countries (DHN 2020). P. 63-77. Retrieved from: http://ceur-ws.org/Vol-2865/paper6.pdf.
  16. Pääkkönen, J., & Ylikoski, P. (2020). Humanistic interpretation and machine learning. Synthese Retrieved from: https://link.springer.com/article/10.1007/s11229-020-02806-w. https://doi.org/10.1007/s11229-020-02806-w
  17. Blei D. M., & Lafferty, J. (2007). A correlated topic model of Science. // Annals of Applied Statistics, Vol.1(1). P. 17-35. Retrieved from: https://projecteuclid.org/journals/annals-of-applied-statistics/volume-1/issue-1/A-correlated-topic-model-of-Science/10.1214/07-AOAS114.full. https://doi.org/10.1214/07-aoas114
  18. Newman, D., Chemudugunta, C., Smyth P., & Steyvers, M. (2006). Analyzing entities and topics in news articles using statistical topic models. Intelligence and Security Informatics, Lecture Notes in Computer Science. Retrieved from: https://www.researchgate.net/publication/221246920_Analyzing_Entities_and_Topics_in_News_Articles_Using_Statistical_Topic_Models. https://doi.org/10.1007/11760146_9
  19. Hall, D., & Jurafsky, C. D. (2008). Manning. Studying the history of ideas using topic models. In EMNLP. Retrieved from: https://web.stanford.edu/~jurafsky/hallemnlp08.pdf. https://doi.org/10.3115/1613715.1613763
  20. Block, S. (2006). Doing More with Digitization: An introduction to topic modeling of early American sources. Common-place: The Interactive Journal of Early American Life. 6.2. Retrieved from: http://commonplace.online/article/doing-more-with-digitization/.
  21. Nelson, R. K. (2011). Mining the Dispatch. Retrieved from: https://dsl.richmond.edu/dispatch/introduction.
  22. Templeton, T. C., Brown, T., Battacharyya, S., & Boyd-Graber, J. (2012). Mining the Dispatch under Supervision: Using Casualty Counts to Guide Topics from the Richmond Daily Dispatch Corpus, Chicago Colloquium on Digital Humanities and Computer Science.
  23. Hengchen, S. (2017). When Does it Mean? Detecting Semantic Change in Historical Texts. Ph.D. thesis. Universite libre de Bruxelles.
  24. Viola, L., & Verheul, J. (2019). Mining ethnicity: Discourse-driven topic modelling of immigrant discourses in the USA, 1898–1920. Digital Scholarship in the Humanities. https://doi.org/10.1093/llc/fqz068
  25. Rhody Lisa M. (2012). Topic Modeling and Figurative Language. Journal of Digital Humanities. Vol. 2, No. 1. Retrieved from: http://journalofdigitalhumanities.org/2-1/topic-model-data-for-topic-modeling-and-figurative-language-by-lisa-m-rhody/.
  26. Underwood, T. (2012). Topic modeling just made simple enough. Blog post. Retrieved from: https://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/.
  27. Chang, J., Boyd-Graber, J., & Gerrish, S., et.al. (2009). Reading Tea Leaves: How Humans Interpret Topic Models. Advances in Neural Information Processing Systems 22 (NIPS 2009). Retrieved from: https://proceedings.neurips.cc/paper/2009/file/f92586a25bb3145facd64ab20fd554ff-Paper.pdf.
  28. Goldstone, A., & Underwood, T. (2012). What Can Topic Models of PMLA Teach Us About the History of Literary Scholarship? Journal of Digital Humanities– 2012. Retrieved from: http://journalofdigitalhumanities.org/2-1/what-can-topic-models-of-pmla-teach-us-by-ted-underwood-and-andrew-goldstone/.
  29. Brett, M. R. (2012). Topic Modeling: A Basic Introduction. Journal of Digital Humanities. Vol.2. No.1. Retrieved from: http://journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-introduction-by-megan-r-brett/.
  30. Kurvinen, H.; Fridlund, M., Oiva, M., & Paju, P. (Ed.). (2020). Towards Digital Histories of Women’s Suffrage Movements. Digital Histories: Emergent Approaches within the New Digital History. Helsinki University Press. P. 159.
  31. Viola, L., & Verheul, J. (2019). Mining ethnicity: Discourse-driven topic modelling of immigrant discourses in the USA, P. 1898–1920. Digital Scholarship in the Humanities. Retrieved from: https://www.researchgate.net/publication/339140752_Mining_ethnicity_Discourse-driven_topic_modelling_of_immigrant_discourses_in_the_USA_1898-1920. https://doi.org/10.1093/llc/fqz068
  32. Wallach, H., Mimno, D., & McCallum, A. (2009). Rethinking LDA: Why priors matter. Advances in Neural Information Processing Systems. Vol.23. Jaunuary, P. 1973-1981. Retrieved from: http://dirichlet.net/pdf/wallach09rethinking.pdf.
  33. Řehůřek, R., & Sojka, P. Software Framework for Topic Modelling with Large Corpora. Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. 2010. Retrieved from: http://is.muni.cz/publication/884893/en.
  34. Röder, M., Both, A., & Hinneburg, A. (2015).Exploring the Space of Topic Coherence Measures. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining WSDM ’15. P. 399-408. Retrieved from: https://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf. https://doi.org/10.1145/2684822.2685324
  35. Blei, D.M., & Lafferty, J.D. (2006). Dynamic topic models. Proceedings of the 23rd international conference on Machine Learning. P. 113-120. https://doi.org/10.1145/1143844.1143859
  36. Znotiņš, A., & Cīrule, E. (2018). NLP-PIPE: Latvian NLP Tool Pipeline. Human Language Technologies – The Baltic Perspective, IOS Press. Vol. 307. P.183-189.
  37. Sievert, C., & Shirley, K. (2014). LDAvis: A method for visualizing and interpreting topics. Proceedings of the workshop on interactive language learning, visualization, and interfaces. Association for Computational Linguistics. P.63-70. Retrieved from: https://www.researchgate.net/publication/265784473_LDAvis_A_method_for_visualizing_and_interpreting_topics. https://doi.org/10.3115/v1/w14-3110
  38. Project: Text Analysis Methods and Tools For Similarity Metrics in Large National Text Corpora. Retrieved from: https://lnb.lv/en/projects/text-analysis-methods-and-tools-similarity-metrics-large-national-text-corpora.
  39. Latvian-Ukrainian Bilateral Cooperation Programme projects. Retrieved from: https://www.lu.lv/en/science/programmes-and-projects/international-programmes/latvian-ukrainian-bilateral-cooperation-programme-projects/.