Digital Humanities in the Nordic Countries 3rd Conference

Session

F-PII-2: Computational Linguistics 2

Time:

Friday, 09/Mar/2018:

4:00pm - 5:30pm

Session Chair: Risto Vilkko

Location: PII

Presentations

4:00pm - 4:30pm
Long Paper (20+10min) [publication ready]

Verifying the Consistency of the Digitized Indo-European Sound Law System Generating the Data of the 120 Most Archaic Languages from Proto-Indo-European

Jouna Pyysalo¹, Mans Hulden², Aleksi Sahala¹

¹University of Helsinki,; ²University of Colorado Boulder

Using state-of-the-art finite-state technology (FST) we automatically generate data of the some 120 most archaic Indo-European (IE) languages from reconstructed Proto-Indo-European (PIE) by means of digitized sound laws. The accuracy rate of the automatic generation of the data exceeds 99%, which also applies in the generation of new data that were not observed when the rules

representing the sound laws were originally compiled. After testing and verifying the consistency of the sound law system with regard to the IE data and the PIE reconstruction, we report the following results:

a) The consistency of the digitized sound law system generating the data of the 120 most archaic Indo-European languages from Proto-Indo-European is verifiable.

b) The primary objective of Indo-European linguistics, a reconstruction theory of PIE in essence equivalent to the IE data (except for a limited set of open research problems), has been provably achieved.

The results are fully explicit, repeatable, and verifiable.

Pyysalo-Verifying the Consistency of the Digitized Indo-European Sound Law System Generating the Data of the_a.pdf

4:30pm - 4:45pm
Short Paper (10+5min) [publication ready]

Towards Topic Modeling Swedish Housing Policies: Using Linguistically Informed Topic Modeling to Explore Public Discourse

Anna Lindahl¹, Love Börjeson²

¹Gothenburg university; ²Graduate School of Education, Stanford University

This study examines how one can apply the method topic modeling to explore the public discourse of Swedish housing policies, as represented by documents from the Swedish parliament and Swedish newstexts. This area is relevant to study because of the current housing crisis in Sweden.

Topic modeling is an unsupervised method for finding topics in large collections of data and this makes it suitable for examining public discourse. However, in most studies which employ topic modeling there is a lack of using linguistic information when preprocessing the data. Therefore, this work also investigates what effect linguistically informed preprocessing has on topic modeling.Through human evaluation, filtering the data based on part of speech is found to have the largest effect on topic quality. Non-lemmatized topics are found to be rated higher than lemmatized topics. Topics from the filters based on dependency relations are found to have low ratings.

Lindahl-Towards Topic Modeling Swedish Housing Policies-256_a.pdf

4:45pm - 5:00pm
Short Paper (10+5min) [abstract]

Embedded words in the historiography of technology and industry, 1931–2016

Johan Jarlbrink, Roger Mähler

University of Umeå, Sweden

From 1931 to 2016 The Swedish National Museum of Science and Technology published a yearbook, Dædalus. The 86 volumes display a great diversity of industrial heritage and cultures of technology. The first volumes were centered on the heavy industry, such as mining and paper plants located in North and Mid-Sweden. The last volumes were dedicated to technologies and products in people’s everyday lives – lipsticks, microwave ovens, and skateboards. During the years Dædalus has covered topics reaching from individual inventors to world fairs, media technologies from print to computers, and agricultural developments from ancient farming tools to modern DNA analysis. The yearbook presents the history of industry, technology and science, but can also be read as a historiographical source reflecting shifting approaches to history over an 80-year period. Dædalus was recently digitized and can now be analyzed with the help of digital methods.

The aim of this paper is twofold: To explore the possibilities of word embedding models within a humanities framework, and to examine the Dædalus yearbook as a historiographical source with such a model. What we will present is work in progress with no definitive findings to show at the time of writing. Yet, we have a general idea of what we would like to accomplish. Analyzing the yearbook as a historiographical source means that we are interested in what kinds of histories it represents, its focus and bias. We follow Ben Schmidt’s (admittedly simplified) suggestion that word embedding models for textual analysis can be viewed and used as supervised topic model tools (Schmidt, 2015). If words are defined by the distribution of the vocabulary of their contexts we can calculate relations between words and explore fields of related words as well as binary relations in order to analyze their meaning. Simple – and yet fundamental – questions can be asked: What is “technology” in the context of the yearbook? What is “industry”? Of special interest in the case of industrial and technological history are binaries such as rural/urban, man/woman, industry/handicraft, production/consumption, and nature/culture. Which words are close to “man”, and which are close to “woman”? Which aspects of the history of technology and industry are related to “production” and which are related to “consumption”?

Word embedding is a comparatively new set of tools and techniques within data science (NLP) with that in common that the words in a vocabulary of a corpus (or several corpora) are assigned numerical representations through some (of a wide variety of different) computation. In most cases, this comes down to not only mapping the words to numerical vectors, but doing so in such a way that the numerical values in the vectors reflect the contextual similarities between words. The computations are based on the distributional hypothesis stemming from (Zellig Harris, 1954), implicating that “words which are similar in meaning occur in similar contexts” (Rubenstein & Goodenough, 1965). The words are embedded (positioned) in a high-dimensional space, each word represented by a vector in the space i.e. a simple representational model based on linear algebra. The dimension of the space is defined by the size of the vectors and the similarity between words then become a matter of computing the difference between vectors in this space, for instance the difference in (euclidian) distance or difference in direction between the vectors (cosine similarity). Within vector space models the former is the most popular under the assumption that related words tend to have similar directions. The arguably most prominent and popular of these algorithms, and the one that we have used, is the skip-gram model Word2Vec (Mikolov et al, 2013). In short, this model uses a neural network to compute the word vectors as results from training the network to predict the probabilities of all the words in a vocabulary being nearby (as defined by a window size) a certain word in focus.

An early evaluation shows that the model works fine. Standard calculations often used to evaluate the performance and accuracy indicates that we have implemented the model correctly – we can indeed get the correct answers to equations such as “Paris - France + Italy = Rome” (Mikolov et al, 2013). In our case we were looking for “most_similar(positive=['sverige','oslo'], negative=['stockholm'])”. And the “most similar” was “norge”. We have also explored simple word similarity in order to evaluate the model and get a better understanding of our corpus. What remains to be done is to identify relevant words (or group of words) that can be used when we are examining “topics” and binary dimensions in the corpus. We are also experimenting with different ways to cluster and visualize the data. Although some work remains to be done, we will definitely have results to present at the time of the conference.

Harris, Zellig (1954). Distributional structure. Word, 10(23):146–162.

Mikolov, Tomas, Chen, Kai, Corrado, Greg & Dean, Jeffrey (2013). Efficient estimation of word representations in vector space. CoRR, abs/1301.3781

Rubenstein, Herbert & Goodenough, John (1965). Contextual Correlates of Synonymy. Communications of the ACM, 8(10): 627-633.

Schmidt, Ben (2015). Word Embeddings for the digital humanities. Blog post at http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html.

Jarlbrink-Embedded words in the historiography of technology and industry, 1931–2016-223_a.pdf

5:00pm - 5:15pm
Short Paper (10+5min) [abstract]

Revisiting the authorship of Henry VIII’s Assertio septem sacramentorum through computational authorship attribution

Marjo Kaartinen, Aleksi Vesanto, Anni Hella

University of Turku

Undoubtedly, one of the great unsolved mysteries of Tudor history through centuries has been the authorship of Henry VIII’s famous treatise Assertio septem sacramentorum adversus Martinum Lutherum (1521). The question of its authorship intrigued the contemporaries already in the 1520s. With Assertio, Henry VIII gained from the Pope the title Defender of the Faith which the British monarchs still use. Because of the exceptional importance of the text, the question of its authorship is not irrelevant in the study of history.

For various reasons and motivations each of their own, many doubted the king’s authorship. The discussion has continued to the present day. A number of possible authors have been named, Thomas More and John Fisher foremost among them. There is no clear consensus about the authorship in general – nor is there a clear agreement upon the extent of the King’s role in the writing process in the cases where joint authorship is suggested. The most commonly shared conclusion indeed is that the King was more or less helped in the writing process and that the authorship of the work was thus shared at least to some degree: that is, even if Henry VIII was active in the writing of Assertio, he was not the sole author but was helped by someone or by a group of theological scholars.

In the case of Assertio, The Academy of Finland funded consortium Profiling Premodern Authors (PROPREAU) has tackled the difficult Latin source situation and put an effort into developing more efficient machine learning methods for authorship attribution in a case where large training corpora are not available. This paper will present the latest discoveries in the development of such tools and will report on the results. These will give historians tools for opening a myriad of questions we have been hitherto unable to answer. It is of great significance for the whole discipline of history to be able to name authors to texts that are anonymous or of disputed origin.

Select Bibliography:

Betteridge, Thomas: Writing Faith and Telling Tales: Literature, Politics, and Religion in the Work of Thomas More. University of Notre Dame Press 2013.

Brown, J. Mainwaring: Henry VIII.’s Book, “Assertio Septem Sacramentorum,” and the Royal Title of “Defender of the Faith”. Transactions of the Royal Historical Society 1880, 243–261.

Nitti, Silvana: Auctoritas: l’Assertio di Enrico VIII contro Lutero. Studi e testi del Rinascimento europeo. Edizioni di storia e letteratura 2005.

Kaartinen-Revisiting the authorship of Henry VIII’s Assertio septem sacramentorum through computational a_a.docx

Conference Agenda