Conference Agenda

Session
T-PIII-3: Computational Literary Analysis
Time:
Thursday, 08/Mar/2018:
4:00pm - 5:30pm

Session Chair: Mads Rosendahl Thomsen
Location: PIII

Presentations
4:00pm - 4:30pm
Long Paper (20+10min) [abstract]

A Computational Assessment of Norwegian Literary “National Romanticism”

Ellen Rees

University of Oslo,

In this paper, I present findings derived from a computational analysis of texts designated as “National Romantic” in Norwegian literary historiography. The term “National Romantic,” which typically designates literary works from approximately 1840 to 1860 that are associated with national identity formation, first appeared decades later, in Henrik Jæger’s Illustreret norsk litteraturhistorie from 1896. Cultural historian Nina Witoszek has on a number of occasions written critically about the term, claiming that it is misleading because the works it denotes have little to do with larger international trends in Romanticism (see especially Witoszek 2011). Yet, with the exception of a 1985 study by Asbjørn Aarseth, it has never been interrogated systematically in the way that other period designations such as “Realism” or “Modernism” have. Nor does Aarseth’s investigation attempt to delimit a definitive National Romantic corpus or account for the remarkable disparity among the works that are typically associated with the term. “National Romanticism” is like pornography—we know it when we see it, but it is surprisingly difficult to delineate in a scientifically rigorous way.

Together with computational linguist Lars G. Johnsen and research assistants Hedvig Solbakken and Thomas Rasmussen, I have prepared a corpus of 217 text that are mentioned in connection with “National Romanticism” in the major histories of Norwegian literature and textbooks for upper secondary instruction in Norwegian literature. I will discuss briefly some of the logistical challenges associated with preparing this corpus.

This corpus forms the point of departure for a computational analysis employing various text-mining methods in order to determine to what degree the texts most commonly associated with “National Romanticism” share significant characteristics. In the popular imagination, the period is associated with folkloristic elements such as supernatural creatures (trolls, hulders), rural farming practices (shielings, herding), and folklife (music, rituals) as well as nature motifs (birch trees, mountains). We therefore employ topic modeling in order to map the frequency and distribution of such motifs across time and genre within the corpus. We anticipate that topic modeling will also reveal unexpected results beyond the motifs most often associated with National Romanticism. This process should prepare us to take the next step and, inspired by Matthew Wilkens’ recent work generating “clusters” of varieties within twentieth-century U.S. fiction, create visualizations of similarities and differences among the texts in the National Romanticism corpus (Wilkens 2016).

Based on these initial computational methods, we hope to be able to answer some of the following literary historical questions:

¥ Are there identifiable textual elements shared by the texts in the National Romantic canon?

¥ What actually defines a National Romantic text as National Romantic?

¥ Do these texts cluster in a meaningful way chronologically?

¥ Is “National Romanticism” in fact meaningful as a period designation, or alternately as a stylistic designation?

¥ Are there other texts that share these textual elements that are not in the canon?

¥ If so, why? Do gender, class or ethnicity have anything to do with it?

To answer the last two questions, we need to use the “National Romanticism” corpus as a sub-corpus and “trawl-line” within the full corpus of nineteenth-century Norwegian textual culture, carrying out sub-corpus topic modeling (STM) in order to determine where similarities with texts from outside the period 1840–1860 arise (Tangherlini and Leonard 2013). For the sake of expediency, we use the National Library of Norway’s Digital Bookshelf as our full corpus, though we are aware that there are significant subsets of Norwegian textual culture that are not yet included in this corpus. Despite certain limitations, the Digital Bookshelf is one of the most complete digital collections of a national textual culture currently available.

For the purposes of DHN 2018, this project might best be categorized as an exploration of cultural heritage, understood in two ways. On the one hand, the project is entirely based on the National Library of Norway’s Digital Bookshelf platform, which, as an attempt to archive as much as possible of Norwegian textual culture in a digital and publicly accessible archive, is in itself a vehicle for preserving cultural heritage. On the other hand, the concept of “National Romanticism” is arguably the most widespread, but least critically examined means of linking cultural heritage in Norway to a specifically nationalist agenda.

References:

Jæger, Henrik. 1896. Illustreret norsk litteraturhistorie. Bind II. Kristiania: Hjalmar Biglers forlag.

Tangherlini, Timothy R. and Peter Leonard. 2013. “Trawling in the Sea of the Great Unread: Sub-Corpus Topic Modeling and Humanities Research.” Poetics 41.6: 725–749.

Wilkens, Matthew. 2016. “Genre, Computation, and the Varieties of Twentieth-Century U.S. Fiction.” CA: Journal of Cultural Analytics (online open-access)

Witoszek, Nina. 2011. The Origins of the “Regime of Goodness”: Remapping the Cultural History of Norway. Oslo: Universitetsforlaget.

Aarseth, Asbjørn. 1985. Romantikken som konstruksjon: tradisjonskritiske studier i nordisk litteraturhistorie. Bergen: Universitetsforlaget.


4:30pm - 4:45pm
Short Paper (10+5min) [abstract]

Prose Rhythm in Narrative Fiction: the case of Karin Boye's Kallocain

Carin Östman, Sara Stymne, Johan Svedjedal

Uppsala university,

Prose Rhythm in Narrative Fiction: the case of Karin Boye’s Kallocain

Swedish author Karin Boye’s (1900-1941) last novel Kallocain (1940) is an icily dystopian depiction of a totalitarian future. The protagonist Leo Kall first embraces this system, but for various reasons rebels against it. The peripety comes when he gives a public speech, questioning the State. It has been suggested (by the linguist Olof Gjerdman) that the novel – which is narrated in the first-person mode – from exactly this point on is characterized by a much freer rhythm (Gjerdman 1942). This paper sets out to test this hypothesis, moving on from a discussion of the concept of rhythm in literary prose to an analysis of various indicators in different parts of Kallocain and Boye’s other novels.

Work on this project started just a few weeks ago. So far we have performed preliminary experiments with simple text quality indicators, like word length, sentence length, and the proportion of punctuation marks. For all these indicators we have compared the first half of the novel, up until the speech, the second half of the novel, and as a contrast also the "censor's addendum", which is a short last chapter of the novel, written by an imaginary censor. For most of these indicators we find no differences between the two major parts of the novel. The only result that points to a more strict rhythm in the first half is that the proportion of long words, both as counted in characters and syllables, are considerably higher there. For instance, the percentage of words with at least five syllables is 1.85% in the first half, and 1.03% in the second half.

The other indicators with a difference does not support the hypothesis, however. In the first half, the sections are shorter, there is proportionally more speech utterances, and there is a higher proportion of three consecutive dots (...), which are often used to mark hesitation. If we compare these two halves to the censor's addendum, however, we can clearly see that the addendum is written in a stricter way, with for instance a considerably higher proportion of long words (4.90% of the words have more than five syllables) and more than double as long sentences.

In future analysis, we plan to use more fine-tuned indicators, based on a dependency parse of the text, from which we can explore issues like phrase length and the proportion of sub-clauses. Separating out speech from non-speech also seems important. We also plan to explore the variation in our indicators, rather than just looking at averages, since this has been suggested in literature on rhythm in Swedish prose (Holm 2015).

Through this initial analysis we have also learned about some of the challenges of analyzing literature. For instance, it is not straightforward to separate speech from non-speech, since the end of utterances are often not clearly marked in Kallocain, and free indirect speech is sometimes used. We think this would be important for future analysis, as well as attribution of speech (Elson & McKeown, 2010), since the speech of the different protagonists cannot be expected to vary in the two parts to the same degree.

References

Boye, Karin (1940) Kallocain: roman från 2000-talet. Stockholm: Bonniers.

Elson, David K. and McKeown, Kathleen R. (2010) Automatic Attribution of Quoted Speech in Literary Narrative. In Proceedings of the 24th AAAI Conference on Artificial Intelligence. The AAAI Press, Menlo Park, pp 1013–1019.

Gjerdman, Olof (1942) Rytm och röst. In Karin Boye. Minnen och studier. Ed. by M. Abenius and O. Lagercrantz. Stockholm: Bonniers, pp 143–160.

Holm, Lisa (2015) Rytm i romanprosa. In Det skönlitterära språket. Ed. by C. Östman. Stockholm: Morfem, pp 215–235.

Authors: Sara Stymne, Johan Svedjedal, Carin Östman (Uppsala University)


4:45pm - 5:00pm
Short Paper (10+5min) [abstract]

The Dostoyevskian Trope: State Incongruence in Danish Textual Cultural Heritage

Kristoffer Laigaard Nielbo, Katrine Frøkjær Baunvig

University of Southern Denmark,

In the history of prolific writers, we are often confronted with the figure of the suffering or tortured

writer. Setting aside metaphysical theories, the central claim seems to be that a state incongruent

dynamic is an intricate part of the creativty process. Two propositions can be derived this claim,

1: the creative state is inversely proportional to the emotional state, and 2: the creative state is

causally predicted by the emotional state. We call this the creative-emotional dynamic ‘The

Dostojevskian Trope’. In this paper we present a method for studying the dostojevskian trope in

prolific writers. The method combines Shannon entropy as an indicator of lexical density and

readability with fractal analysis in order to measure creative dynamics over multiple documents.

We generate a sentiment time series from the same documents and test for causal dependencies

between the creative and sentiment time series. We illustrate the method by searching for the

dostojevskian trope in Danish textual cultural heritage, specifically three highly prolific writers

from the 19th century, namely, N.F.S. Grundtvig, H.C. Andersen, and S.A. Kierkegaard.


5:00pm - 5:30pm
Long Paper (20+10min) [abstract]

Interdisciplinary advancement through the unexpected: Mapping gender discourses in Norway (1840-1913) with Bokhylla

Heidi Karlsen

University of Oslo,

Abstract for long format presentation

Heidi Karlsen, University of Oslo

Ph.D. Candidate in literature, Cand.philol. in philosophy

Interdisciplinary advancement through the unexpected: Mapping gender discourses in Norway (1840-1913) with Bokhylla

This presentation discusses challenges related to sub-corpus topic modeling in the study of gender discourses in Norway from 1840 till 1913 and the role of interdisciplinary collaboration in this process. Through collaboration with the Norwegian National Library, data-mining techniques are used in order to retrieve data from the digital source, Bokhylla [«the Digital Bookshelf»], for the analysis of women’s «place» in society and the impact of women writers on this discourse. My project is part of the research project «Data-mining the Digital Bookshelf», based at the University of Oslo.

1913, the closing year of the period I study, is the year of women’s suffrage in Norway. I study the impact women writers had on the debate in Norway regarding women’s «place» in society, during the approximately 60 years before women were granted the right to vote. A central hypothesis for my research is that women writers in the period had an underestimated impact on gender discourses, especially in defining and loading key words with meaning (drawing on mainly Norman Fairclough’s theoretical framework for discourse analysis). In this presentation, I examine a selection of Swedish writer Fredrika Bremer’s texts, and their impact on gender discourses in Norway.

The Norwegian National Library’s Digital Bookshelf, is the main source for the historical documents I use in this project. The Digital Bookshelf includes a vast amount of text published in Norway over several centuries, text of a great variety of genres, and offers thus unique access to our cultural heritage. Sub-corpus topic modeling (STM) is the main tool that has been used to process the Digital Bookshelf texts for this analysis. A selection of Bremer’s work has been assembled into a sub-corpus. Topics have then been generated from this corpus and then applied to the full Digital Bookshelf corpus. During the process, the collaboration with the National Library has been essential in order to overcome technical challenges. I will reflect upon this collaboration in my presentation. As the data are retrieved, then analyzed by me as a humanities scholar, and weaknesses in the data are detected, the programmer, at the National Library assisting us on the project, presents, modifies and develops tools in order to meet our challenges. These tools might in turn represent additional possibilities beyond what they were proposed for. New ideas in my research design may emerge as a result. Concurrently, the algorithms created at such a stage in the process, might successively be useful for scholars in completely different research projects. I will mention a few examples of such mutually productive collaborations, and briefly reflect upon how these issues are related to questions regarding open science.

In this STM process, several challenges have emerged along the way, mostly related to OCR errors. Some illustrative examples of passages with such errors will be presented for the purpose of discussing the measures undertaken to face the problems they give rise to, but also for demonstrating the unexpected progress stemming from these «defective» data. The topics used as a «trawl line»(1), in the initial phase of this study, produced few results. Our first attempt to get more results was to revise down the required Jaccard similarity(2). This entails that the quantity of a topic that had to be identified in a passage in order for it to qualify as a hit, is lowered. As this required topic quantity was lowered, a great number of results were obtained. The obvious weakness of these results, however, is that the rather low required topic match, or relatively low value of the required Jaccard similarity, does not allow us to affirm a connection between these passages and Bremer’s text. Nevertheless, the results have still been useful, for two reasons. Some of the data have proven to be valuable sources for the mapping of gender discourses, although not indicating anything regarding women writer’s impact on them. Moreover, these passages have served to illustrate many of the varieties of OCR errors that my topic words give rise to in text from the period I study (frequently in Gothic typeface). This discovery has then been used to improve the topics, which takes us to the next step in the process.

In certain documents one and the same word in the original text has, in the scanning of the document, given rise to up to three different examples of OCR errors(3). This discovery indicates the risk of missing out on potentially relevant documents in the «great unread»(4). If only the correct spelling of the words is included in the topics, potentially valuable documents with our topic words in them, bizarrely spelled because of errors in the scanning, might go unnoticed. In an attempt to meet this challenge I have manually added to the topic the different versions of the words that the OCR errors have given rise to (for instance for the word «kjærlighed» [love] «kjaerlighed», «kjcerlighed», «kjcrrlighed»). We cannot in that case, when we run the topic model, require a one hundred percent topic match, perhaps not even 2/3, as all these OCR errors of the same word are highly unlikely to take place in all potential matches(5). Such extensions of the topics, condition in other words our parameterization of the algorithm: the required value of Jaccard similarity for a passage to be captured has to be revised fairly down. The inconvenience of this approach, however, is the possible high number of captured passages that are exaggeratedly (for our purpose) saturated with the semantic unit in question. Furthermore, if we add to this the different versions of a lexeme and its semantic relatives that in some cases are included in the topic, such as «kvinde», «kvinder», «kvindelig», kvindelighed» [woman, women, feminine, femininity], the topic in question might catch an even larger number of passages with a density of this specific semantic unity with its variations; this is an amount that is not proportional to the overall variety of the topic in question.

This takes us back to the question of what we program the “trawl line” to “require” in order for a passage in the target corpus to qualify as a hit, and as well to how the scores are ranged. How many of the words in the topic, and to what extent do several occurrences of one of the topic’s words, i.e., five occurrences of “woman” in one paragraph interest us? The parameter can be set to range scores in function of the occurrences of the different words forming the topic, meaning that the score for a topic in a captured passage is proportional to the heterogeneity of the occurrences of the topic’s words, not only the quantity. However, in some cases we might, as mentioned, have a topic comprehending several forms of the same lexeme and its semantic relatives and, as described, several versions of the same word due to OCR errors. How can the topic model be programmed in order to take into account such occurrences in the search for matching passages? In order to meet this challenge, a «hyperlexeme sensitive» algorithm has been created (6). This means that the topic model is parameterized to count the lexeme frequency in a passage. It will also range the scores in function of the occurrence of the hyperlexeme, and not treat occurrences of different forms of one lexeme equally to the ones of more semantically heterogenous word-units in the topic. Furthermore, and this is the point to be stressed, this algorithm is programmed to treat miss-spelling of words, due to OCR errors, as if they were different versions of the same hyperlexeme.

The adjustments of the value of the Jaccard similarity and the hyperlexeme parameterization are thus measures conducted in order to compensate for the mentioned inconveniences, and improve and refine the topic model. I will show examples that compare the before and after these parameters were used, in order to discuss how much closer we have got to be able to establish actual links between the sub-corpus, and passages the topics have captured in the target corpus. All the technical concepts will be defined and briefly explained as I get to them in the presentation. The genesis of these measures, tools and ideas at crucial moments in the process, taking place as a result of unexpected findings and interdisciplinary collaboration, will be elaborated on in my presentation, as well as the potential this might offer for new research.

Notes:

(1) My description of the STM process, with the use of tropes such as «trawl line» is inspired by Peter Leonard and Timothy R. Tangherlini (2013): “Trawling in the Sea of the Great Unread: Sub-corpus topic modeling and Humanities research” in Poetics. 41, 725-749

(2) The Jaccard index is taken into account in the ranging of the scores. The best hit passage for a topic, the one with highest score, will be the one with highest relative similarity to the other captured passages, in terms of concentration of topic words in the passage. The parameterized value of the required Jaccard similarity defines the score a passage must receive in order to be included in the list of captured passages from the «great unread».

(3) Some related challenges were described by Kimmo Kettunen and Teemu Ruokolainen in their presentation, «Tagging Named Entities in 19th century Finnish Newspaper Material with a Variety of Tools» at DHN2017.

(4) Franco Moretti (2000) (drawing on Margareth Cohen) calls the enormous amount of works that exist in the world for «the great unread» (limited to Bokhylla’s content in the context of my project) in: «Conjectures of World Literature» in New Left Review. 1, 54-68.

(5) As an alternative to include in the topic all detected spelling variations, due to OCR errors, of the topic words, we will experiment with taking into account the Levenshtein distance when programming the «trawl line». In that case it is not identity between a topic word and a word in a passage in the great unread that matters, but the distance between two words, the minimum number of single-character edits required to change one word into the other, for instance «kuinde»-> «kvinde».

(6) By the term «hyperlexeme» we understand a collection of graphemic occurences of a lexeme, including spelling errors and semantically related forms.