Digital Humanities in the Nordic Countries 3rd Conference

11:30am - 11:45am
Short Paper (10+5min) [abstract]

A newspaper atlas: Named entity recognition and geographic horizons of 19th century Swedish newspapers

Erik Edoff

Umeå University

What was the outside world for 19th century newspaper readers? That is the overarching problem investigated in this paper. One way of facing this issue is to investigate what geographical places that was mentioned in the newspaper, and how frequently. For sure, newspapers were not the only medium that contributed to 19th century readers’ notion of the outside world. Public meetings, novels, sermons, edicts, travelers, photography, and chapbooks are other forms of media that people encountered with a growing regularity during the century; however, newspapers often covered the sermons, printed lists of travelers and attracted readers with serial novels. This means, at least to some extent, that these are covered in the newspapers columns. And after all, the newspapers were easier to collect and archive than a public meeting, and thus makes it an accessible source for the historian.

Two newspapers, digitized by the National Library of Sweden, are analyzed: Tidning för Wenersborgs stad och län (TW) and Aftonbladet (AB). They are chosen based on their publishing places’ different geographical and demographical conditions as well as the papers’ size and circulation. TW was founded in 1848 in the town of Vänersborg, located on the western shore of lake Vänern, which was connected with the west coast port, Göteborg, by the Trollhätte channel, established in 1800. The newspaper was published in about 500 copies once a week (twice a week from 1858) and addressed a local and regional readership. AB was a daily paper founded in Stockholm in 1830 and was soon to become the leading liberal paper of the Swedish capital, with a great impact on national political discourse. For its time, it was widely circulated (between 5,000 and 10,000 copies) in both Stockholm and the country as a whole. Stockholm was an important seaport on the eastern coast. These geographic distinctions probably mean interesting differences in the papers’ respective outlook. The steamboats revolutionized travelling during the first half of the century, but its glory days had passed around 1870, and was replaced by railways as the most prominent way of transporting people.

This paper is focusing on comparing the geographies of the two newspapers by analyzing the places mentioned in the periods 1848–1859 and 1890–1898. The main railroads of Sweden were constructed during the 1860s, and the selected years therefore cover newspaper geographies before and after railroads.

The main questions of paper addresses relate to media history and history of media infrastructure. During the second half of the 19th century several infrastructure technologies were introduced and developed (electric telegraph, postal system, newsletter corporations, railways, telephony, among others). The hypothesis is that these technologies had an impact on the newspapers’ geographies. The media technologies enabled information to travel great distances in short timespans, which could have homogenizing effects on newspaper content, which is suggested by a lot of traditional research (Terdiman 1999). On the other hand, digital historical research has shown that the development of railroads changed the geography of Houston newspapers, increasing the importance of the near region rather than concentrating geographic information to national centers (Blevins 2014).

The goal of the study is in other words to investigate what these the infrastructural novelties introduced during the course of the 19th century as well as the different geographic and demographic conditions meant for the view of the outside world or the imagined geographies provided by newspapers. The aim of the paper is therefore twofold: (1) to investigate a historical-geographical problem relating to newspaper coverage and infrastructural change and (2) to tryout the use of Named Entity Recognition on Swedish historical newspaper data.

Named Entity Recognition (NER) is a software that is designed to locate and tag entities, such as persons, locations, and organizations. This paper uses SweNER to mine the data for locations mentioned in the text (Kokkinakis et al. 2014). Earlier research has emphasized the problems with bad OCR-scanning of historical newspapers. A picture of a newspaper page is read by an OCR-reading software and converted into a text file. The result contains a lot of misinterpretations and therefore considerable amount of noise (Jarlbrink & Snickars 2017). This is a big obstacle when working with digital tools on historical newspapers. Some earlier research has used and evaluated the performance of different NER-tools on digitized historical newspapers, also underlining the OCR-errors as the main problem with using NER on such data (Kettunen et al. 2017). SweNER has also been evaluated in tagging named entities in historical Swedish novels, where the OCR problems are negligible (Borin et al 2007). This paper, however, does not evaluate the software’s result in a systematic way, even though some important biases have been identified by going through the tagging of some newspaper copies manually. Some important geographic entities are not tagged by SweNER at all (e.g. Paris, Wien [Vienna], Borås and Norge [Norway]). SweNER is able to pick up some OCR-reading mistakes, although many recurring ones (e.g. Lübeck read as Liibeck, Liibcck, Ltjbeck, Ltlbeck) are not tagged by SweNER. These problems can be handled, at least to some degree, by using “leftovers” from the data (wrongly spelled words) that was not matched in a comparison corpus. I have manually scanned the 50,000 most frequently mentioned words that was not matched in the comparative corpus, looking for wrongly spelled names of places. I ended up with a list of around 1,000 places and some 2,000 spelling variations (e.g. over 100 ways of spelling Stockholm). This manually constructed list could be used as a gazetteer, complementing the NER-result, giving a more accurate result of the 19th century newspaper geographies.

REFERENCES

Blevins, C. (2014), ”Space, nation, and the triumph of region: A view on the world from Houston”, Journal of American History, Vol. 101, no 1, pp. 122–147.

Borin, L., Kokkinakis, D., and Olsson, L-G. (2007), “Naming the past: Named entity and animacy recognition in 19th century Swedish literature”, Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007), pp. 1–8, available at: http://spraakdata.gu.se/svelb/pblctns/W07-0901.pdf (accessed October 31 2017).

Jarlbrink, J. and Snickars, P. (2017), “Cultural heritage as digital noise: Nineteenth century newspapers in the digital archive”, Journal of Documentation, Vol. 73, no 6, pp. 1228–1243.

Kettunen, K., Mäkelä, E., Ruokolainen, T., Kuokkala, J., and Löfberg, L. (2017), ”Old content and modern tools: Searching named entities in a Finnish OCRed historical newspaper collection 1771–1910”, Digital Humanities Quarterly, (preview) Vol. 11, no 3.

Kokkinakis, D., Niemi, J., Hardwick, S., Lindén, K., and Borin, L., (2014), ”HFST-SweNER – A new NER resource for Swedish”, Proceedings of the 9th edition of the Language Resources and Evaluation Conference (LREC), Reykjavik 26–31 May 2014., pp. 2537-2543

Terdiman, R. (1999) “Afterword: Reading the news”, Making the news: Modernity & the mass press in nineteenth-century France, Dean de la Motte & Jeannene M. Przyblyski (eds.), Amherst: University of Massachusetts Press.

12:15pm - 12:30pm
Short Paper (10+5min) [abstract]

Two cases of meaning change in Finnish newspapers, 1820-1910

Antti Kanner

University of Helsinki,

In Finland the 19th century saw the formation of number of state institutions that came to define the political life of the Grand Duchy and of subsequent independent republic. Alongside legal, political, economical and social institutions and organisations, the Modern Finnish, as institutionally standardised language, can be seen in this context as one of these institutions. As the majority of residents of Finland were native speakers of Finnish dialects, adopting Finnish was necessary for state’s purposes in extending its influence within the borders of autonomous Grand Duchy. Obviously widening domains of use of Finnish played also an important role in the development of Finnish national identity. In the last quarter of 19th century Finnish started to gain ground as the language of administrative, legal and political discourses alongside Swedish. It is this period we find the crucial conceptual processes that shape Finnish political history well into 20th century.

In this paper I will present two related case studies from my doctoral research, where I seek to understand the semantic similarity scores of so-called Semantic Vector Spaces obtained from large historical corpora in terms of linguistic semantics. As historical corpora are collections of past speech acts, view they provide to changing meanings of words is as much influenced by pragmatic factors and writers’ intentions as synchronic semantics. Understanding and explicating the historical context of observed processes is essential when studying temporal dynamics in semantic changes. For this end, I will try to reflect the theoretical side of my work in the light of cases of historical meaning changes. My research falls under the heading of Finnish Language, but is closely related to history and computational linguistics.

The main data for my research comes from the National Library of Finland’s Newspaper Collection, which I use via KORP service API provided by Language Bank of Finland. The collection accessible via the API contains nearly all newspapers and periodicals published in Finland from 1771 to 1910. The collection is however very heterogenous, as the press and other forms of printed public discourse in Finnish only developed in Finland during the 19th century. Historical variation in conventions of typesetting, editing and orthography as well as paper quality used for printing make it very difficult for OCR systems to recognize characters with 100 percent accuracy. Kettunen et. al. estimated that OCR accuracy is actually somewhere between 60 and 80 percent. However, not all problems in the automatic recognition of the data come from OCR problems or even historical spelling variation. Much is also due to linguistic factors: the 19th century saw large scale dialectal, orthographical and lexical variation in written Finnish. To exemplify the scale of variation, when a morphological analyser for Modern Finnish (OMORFI, Pirinen 2015) was used, it could only parse around 60 percent of the wordlist of the Corpus of Early Modern Finnish (CEMF).

For the reason of unreliability of results from automated parser and the temporal heterogeneity inherent in the data, conducting the study with methodology robust for these kinds of problems poses a challenge. The approach chosen was to use number of analysis and see whether their results could be combined to produce a coherent view of the historical change in word use. In addition, simpler and more robust analysis were chosen instead of more advanced and elaborated ones. For example, analysis similar to topic modelling was conducted using second order collocations (Bertels & Speelman 2014 and Heylen, Wielfaerts, Speelman, Geeraerts 2014) instead of algorithms like LDA (Blei, Ng & Young 2004), that are widely used for this purpose. This was because the data contains an highly inflated count of individual types and lemmas resulting from the problems with OCR and morphological analysis. It seemed that in this specific case at least, LDA was not able to produce sensible topics because the number of hapax legomena per text was so high. The analysis applied based on second order collocations aimed not at producing a model of system of topics, as the LDA does, but to simply cluster studied word’s collocating words based on their respective similarities. Also when tracking changes in words’ syntactic positioning tendencies, instead of resource intensive syntactic parsing, that is also sensitive to errors in data, simple morphological case distribution was used. When the task is to track signals of change, morphological case distributions can be used as sufficient proxies for dependency distributions. This can be done on the grounds that the case selection in Finnish is mostly governed by syntax, as case selection is used to express syntactic relations between, for example constituents of nominal phrases or predicate verb and its arguments (Vilkuna 1989).

The first of my case studies focuses on Finnish word maaseutu. Maaseutu denotes rural area but is in Modern Finnish mostly used as a collective singular referring to the rural as a whole. It is most commonly used as an opposite to the urban, which is often lexicalised as kaupunki, the city, used in similar collective meaning. However, after its introduction to Finnish in 1830’s maaseutu was used in variety of related meanings, mostly referring to specific rural areas or communities, until the turn of the century, when the collective singular sense had become dominant. Starting roughly from 1870’s, however, there seems to have been a period of contesting uses. At that time we find a number of vague cases where the meanings generic or collective and specific meanings overlap.

Combining information from my analysis to newspaper metadata yields an image of dynamic situation. The emergence of the collective singular stands out clearly and is being connected to an accompanying discourse of negotiating urban-rural relations on a national instead of regional level. This change can be pinpointed quite precisely to 1870’s and to the newspapers with geographically wider circulation and more national identity.

The second word of interest is vaivainen, an adjective referring to a person or a thing either being of wretched or inadequate quality or suffering from an physical or mental ailment. When used as a noun, it refers to a person of very low and excluded social status and extreme poverty. In this meaning the word appears in Modern Finnish mostly in poetically archaic or historical contexts but has disappeared from vocabulary of social policy or social legislation already in the early 20th century. The word has a biblical background, being used in older Finnish Bible translations, in for example Sermon on the Mount (as the equivalent of poor in Matt. 5:13 “blessed are the poor in spirit”), and as such was natural choice to name the recipients of church charities. When the state poverty relief system started to take its form in the mid 19th century, it built on top of earlier church organizations (Von Aerschot 1996) and the church terminology was carried over to the state institutions.

When tracking the contexts of the word over the 19th century using context word clusters based on second order collocations, two clear discoursal trends appear: the poverty relief discourse that already in the 1860’s is pronounced in the data disperses into a complex network of different topics and discoursive patterns. As the state run poverty relief institutions become more complex and more efficiently administered, the moral foundings of the whole enterprise are discussed alongside reports of everyday comings and goings of individual institutions or, indeed, tales of individual relief recipients fortunes. The other trend involves the presence of religious or spiritual discourse which, against preliminary assumptions does not wane into the background but experiences a strong surge in the 1870’s and 1880’s. This can be explained in part by growth of revivalist Christian publications in the National Library Corpus, but also by intrusion of Christian connotations in the political discussion on poverty relief system. It is as if the word vaivainen functions as a kind of lightning rod of Christian morality into the public poverty relief discourse.

While methodological contributions of this paper are not highly ambitious in terms of language technology or computational algorithms used, the selection of analysis presents an innovative approach to Digital Humanities. The aim here has been to combine not just one, but an array of simple and robust methods from computational linguistics to theoretical background and analytical concepts from lexical semantics. I argue that robustness and simplicity of methods makes the overall workflow more transparent, and this transparency makes it easier to interpret the results in wider historical context. This allows to ask questions whose relevance is not confined to computational linguistics or lexical semantics, but expands to wider areas of Humanities scholarship. This shared relevance of questions and answers, to my understanding, lies at the core of Digital Humanities.

References

Bertels, A. & Speelman, D. (2014). “Clustering for semantic purposes. Exploration of semantic similarity in a technical corpus.” Terminology 20:2, pp. 279–303. John Benjamins Publishing Company.

Blei, D., Ng, A. Y. & Jordan, M. I. (2003). “Latent Dirichlecht Allocation.” Journal of Machine Learning Research 3 (4–5). Pp. 993–1022.

CEMF, Corpus of Early Modern Finnish. Centre for Languages in Finland. http://kaino.kotus.fi

Heylen, C., Peirsman Y., Geeraerts, D. & Speelman, D. (2008). “Modelling Word Similarity: An Evaluation of Automatic Synonymy Extraction Algorithms.” Proceedings of LREC 2008.

Huhtala, H. (1971). Suomen varhaispietistien ja rukoilevaisten sanankäytöstä :

semanttis-aatehistoriallinen tutkimus. [On the vocabulary of the early Finnish pietist

and revivalist movements]. Suomen Teologinen Kirjallisuusseura.

Kettunen, K., Honkela, T., Lindén, K., Kauppinen, P., Pääkkönen, T. & Kervinen, J.

(2014). “Analyzing and Improving the Quality of a Historical News Collection

using Language Technology and Statistical Machine Learning Methods”. In

IFLA World Library and Information Congress Proceedings : 80th IFLA

General Conference and Assembly. Lyon. France.

Pirinen, T. (2015). “Omorfi—Free and open source morphological lexical database for

Finnish”. In Proceedings of the 20th Nordic Conference of Computational

Linguistics NODALIDA 2015.

Vilkuna, M. (1989). Free word order in Finnish: Its syntax and discourse functions.

Suomalaisen Kirjallisuuden Seura.

Von Aerschot, P. (1996). Köyhät ja laki: toimeentukilainsäädännön kehittyminen kehitys

oikeudellistusmisprosessien valossa. [The poor and the law: development of Finnish

welfare legislation in light juridification processes.] Suomalainen Lakimiesyhdistys.

Digital Humanities in the Nordic Countries
3rd Conference

7–9 March 2018, Helsinki

Conference Agenda