Conference Agenda

Digital Humanities in the Nordic Countries 3rd Conference

Location: PIII
 
Date: Wednesday, 07/Mar/2018
4:00pm - 5:30pmW-PIII-1: Computational Linguistics 1
Session Chair: Lars Borin
PIII 
 
4:00pm - 4:30pm
Long Paper (20+10min) [abstract]

Dialects of Discord. Using word embeddings to analyze preferred vocabularies in a political debate: nuclear weapons in the Netherlands 1970-1990

Ralf Futselaar, Milan van Lange

NIOD, Institute for War-, Holocaust-, and Genocide Studies

We analyze the debate about the placement of nuclear-enabled cruise missiles in the Netherlands during the 1970s and 1980s. The NATO “double-track decision” of 1979 envisioned the placement of these weapons in the Netherlands, to which the Dutch government eventually agreed in 1985. In the early 1980s, the controversy regarding placement or non-placement of these missiles led to the greatest popular protests in Dutch history and to a long and often bitter political controversy. After 1985, due to declining tensions between the Societ Block and NATO, the cruise missiles were never stationed in the Netherlands. Much older nuclear warheads, in the country since the early 1960s, remain there until today.

We are using word embeddings to analyze this particularly bipolar debate in the proceedings of the Dutch lower and upper house of Parliament. The official political positions, as expressed in party manifestos and voting behavior inside parliament, were stable throughout this period. We demonstrate that in spite of this apparent stability, the vocabularies used by representatives of different political parties changed significantly through time.

Using the word2vec algorithm, we have created a combined vector including all synonyms and near-synonyms of “nuclear weapon” used in the proceedings of both houses of parliament during the period under scrutiny. Based on this combined vector, and again using word2vec, we have identified nearest neighbors of words used to describe nuclear weapons. These terms have been manually classified, insofar relevant, into terms associated with a pro-proliferation or anti-proliferation viewpoint, for example “defense” and “disarmament” respectively.

Obviously, representatives of all Dutch political parties used words from both categories in parliamentary debates. At any given time, however, we demonstrate that different political parties could be shown to have clear preferences in terms of vocabulary. In the “discursive space” created by the binary opposition between pro- and contra-proliferation words, political parties can be shown to have specific and distinct ways of discussing nuclear weapons.

Using this framework, we have analyzed the changing vocabularies of different political parties. This allows us to show that, while stated policy positions and voting behavior remained unchanged, the language used to discuss nuclear weapons shifted strongly towards anti-proliferation terminology. We have also been able to show that this change happened at different times for different political parties. We speculate that these changes resulted from perceived changes of opinion among the target electorates of different parties, as well as the changing geopolitical climate of the mid-to-late 1980s, where nuclear non-proliferation became a more widely shared policy objective.

In the conclusion of this paper, we show that word embedding models offer a, methodology to investigate shifting political attitudes outside of, and in addition to, stated opinions and voting patterns.


4:30pm - 4:45pm
Distinguished Short Paper (10+5min) [publication ready]

Emerging Language Spaces Learned From Massively Multilingual Corpora

Jörg Tiedemann

University of Helsinki,

Translations capture important information about languages that can be used as implicit supervision in learning linguistic properties and semantic representations. Translated texts are semantic mirrors of the original text and the significant variations that we can observe across languages can be used to disambiguate the meaning of a given expression using the linguistic signal that is grounded in translation. Parallel corpora consisting of massive amounts of human translations with a large linguistic variation can be used to increase abstractions and we propose the use of highly multilingual machine translation models to find language-independent meaning representations. Our initial experiments show that neural machine translation models can indeed learn in such a setup and we can show that the learning algorithm picks up information about the relation between languages in order to optimize transfer leaning with shared parameters. The model creates a continuous language space that represents relationships in terms of geometric distances, which we can visualize to illustrate how languages cluster according to language families and groups. With this, we can see a development in the direction of data-driven typology -- a promising approach to empirical cross-linguistic research in the future.


4:45pm - 5:15pm
Long Paper (20+10min) [publication ready]

Digital cultural heritage and revitalization of endangered Finno-Ugric languages

Anisia Katinskaia, Roman Yangarber

University of Helsinki, Department of Computer Science

Preservation of linguistic diversity has long been recognized as a crucial, integral part of supporting our cultural heritage. Yet many ”minority” languages—lacking state official status—are in decline, many severely endangered. We present a prototype system aimed at ”heritage” speakers of endangered Finno-Ugric languages. Heritage speakers are people who have heard the language used by the older generations while they were growing up, and possess a considerable passive

competency (well beyond the ”beginner” level), but are lacking in active fluency. Our system is based on natural language processing and artificial intelligence. It assists the learners by allowing them to use arbitrary texts of their choice, and by creating exercises that require them to engage in active production of language—rather than in passive memorization of material. Continuous automatic assessment helps guide the learner toward improved fluency. We believe that providing such AI-based tools will help bring these languages to the forefront of the modern digital age, raise prestige, and encourage the younger generations to become involved in reversal of decline.


5:15pm - 5:30pm
Short Paper (10+5min) [publication ready]

The Fractal Structure of Language: Digital Automatic Phonetic Analysis

William A Kretzschmar Jr

University of Georgia,

In previous study of the Linguistic Atlas data from the Middle and South Atlantic States (e.g. Kretzschmar 2009, 2015), it has been shown that the frequency profiles of variant lexical responses to the same cue are all patterned in nonlinear A-curves. Moreover, these frequency profiles are scale-free, in that the same A-curve patterns occur at every level of scale. In this paper, I will present results from a new study of Southern American English that, when completed, will include over one million vowel measurements from interviews with a sample of sixty-four speakers across the South. Our digital methods, adaptation of the DARLA and FAVE tools for forced alignment and automatic formant extraction, prove that speech outside of the laboratory or controlled settings can be processed by automatic means on a large scale. Measurements in F1/F2 space are analyzed using point-pattern analysis, a technique for spatial data, which allows for creation and comparison of results without assumptions of central tendency. This Big Data resource allows us to see the fractal structure of language more completely. Not only do A-curve patterns describe the frequency profiles of lexical and IPA tokens, but they also describe the distribution of measurements of vowels in F1/F2 space, for groups of speakers, for individual speakers, and even for separate environments in which vowels occur. These findings are highly significant for how linguists make generalizations about phonetic data. They challenge the boundaries that linguists have traditionally drawn, whether geographic, social, or phonological, and demand that we use a new model for understanding language variation.

 

 
Date: Thursday, 08/Mar/2018
11:00am - 12:30pmT-PIII-1: Open and Closed
Session Chair: Olga Holownia
PIII 
 
11:00am - 11:30am
Long Paper (20+10min) [abstract]

When Open becomes Closed: Findings of the Knowledge Complexity (KPLEX) Project.

Jennifer Edmond, Georgina Nugent Folan, Vicky Garnett

Trinity College Dublin, Ireland

The future of cultural heritage seems to be all about “data.” A Google search on the term ‘data’ returns over 5.5 billion hits, but the fact that the term is so well embedded in modern discourse does not necessarily mean that there is a consensus as to what it is or should be. The lack of consensus regarding what data are on a small scale acquires greater significance and gravity when we consider that one of the major terminological forces driving ICT development today is that of "big data." While the phrase may sound inclusive and integrative, "big data" approaches are highly selective, excluding any input that cannot be effectively structured, represented, or, indeed, digitised. The future of DH, of any approaches to understanding complex phenomena or sources such as are held in cultural heritage institutions, indeed the future of our increasingly datafied society, depend on how we address the significant epistemological fissures in our data discourse. For example, how can researchers claim that "when we speak about data, we make no assumptions about veracity" while one of the requisites of "big data" is "veracity"? On the other hand, how can we expect humanities researchers to share their data on open platforms such as the European Open Science Cloud (EOSC) when we, as a community, resist the homogenisation implied and required by the very term “data”, and share our ownership of it with both the institutions that preserve it and the individuals that created it? How can we strengthen European identities and transnational understanding through the use of ICT systems when these very systems incorporate and obscure historical biases between languages, regions and power elites? In short, are we facing a future when the mirage of technical “openness” actually closes off our access to the perspectives, insight and information we need as scholars and as citizens? Furthermore, how might this dystopic vision be avoided?

These are the kinds of questions and issues under investigation by the European Horizon 2020 funded Knowledge Complexity (KPLEX) project. by applying strategies developed by humanities researchers to deal with complex messy, cultural data; the very kind of data that resists datafication and poses the biggest challenges to knowledge creation in large data corpora environments. Arising out of the findings of the KPLEX project, this paper will present the synthesised findings of an integrated set of research questions and challenges addressed by a diverse team led by Trinity College Dublin (Ireland) and encompassing researchers in Freie Universität Berlin (Germany), DANS-KNAW (The Hague) and TILDE (Latvia). We have adopted a comparative, multidisciplinary, and multi-sectoral approach to addressing the issue of bias in big data; focussing on the following 4 key challenges to the knowledge creation capacity of big data approaches:

1. Redefining what data is and the terms we use to speak of it (TCD);

2. The manner in which data that are not digitised or shared become "hidden" from aggregation systems (DANS-KNAW);

3. The fact that data is human created, and lacks the objectivity often ascribed to the term (FUB);

4. The subtle ways in which data that are complex almost always become simplified before they can be aggregated (TILDE).

The paper will presenting a synthesised version of these integrated research questions, and discuss the overall findings and recommendations of the project, which completes its work at the end of March 2018. What follows gives a flavour of the work ongoing at the time of writing this abstract, and the issues that will be raised in the DHN paper.

1. Redefining what data is and the terms we use to speak of it. Many definitions of data, even thoughtful scholarly ones, associate the term with a factual or objective stance, as if data were a naturally occurring phenomenon. But data is not fact, nor is it objective, nor can it be honestly aligned with terms such as ‘signal’ or ‘stimulus,’ or the quite visceral (but misleading) ‘raw data.’ To become data, phenomena must be captured in some form, by some agent; signal must be separated from noise, like must be organised against like, transformations occur. These organisational processes are human determined or human led, and therefore cannot be seen as wholly objective; irrespective of how effective a (human built) algorithm may be. The core concerns of this facet of the project was to expand the understanding of the heterogeneity of definitions of data, and the implications of this state of understanding. Our primary ambition under this theme was to establish a clear taxonomy of existing theories of data, to underpin a more applied, comparative comparison of humanistic versic technical applications of the term. We did this by identifying the key terms (and how they are used differently), key points of bifurcation, and key priorities under each conceptualisation of data. As such, this facet of the project supported the integrated advancement of the three other project themes, as well as itself developing new perspectives on the rhetorical stakes and action implications of differing concepts of the term ‘data’ and how these will impact on the future not only of DH but of society at large.

2. Dealing with ‘hidden’ data. According to the 2013 ENUMERATE Core 2 survey, only 17% of the analogue collections of European heritage institutions had at that time been digitised. This number actually represents a decrease over the findings of their 2012 survey (almost 20%). The survey also reached only a limited number of respondents: 1400 institutions over 29 countries, which surely captures the major national institutions but not local or specialised ones. Although the ENUMERATE Core 2 report does not break down these results by country, one has to imagine that there would be large gaps in the availability of data from some countries over others. Because so much of this data has not been digitised, it remains ‘hidden’ from potential users. This may have always been the case, as there have always been inaccessible collections, but in a digital world, the stakes and the perceptions are changing. The fact that so much other material is available on-line, and that an increasing proportion of the most well-used and well-financed cultural collections are as well, means that the reasonable assumption of the non-expert user of these collections is that what cannot be found does not exist (whereas in the analogue age, collections would be physically contextualised with their complements, leaving the more likely assumption to be that more information existed, but could not be accessed). The threat that our narratives of histories and national identities might thin out to become based on only the most visible sources, places and narratives is high. This facet of the project explored the manner in which data that are not digitised or shared become "hidden" from aggregation systems.

3. Knowledge organisation and epistemics of data. The nature of humanities data is such that even within the digital humanities, where research processes are better optimised toward the sharing of digital data, sharing of "raw data" remains the exception rather than the norm. The ‘instrumentation’ of the humanities researcher consists of a dense web of primary, secondary and methodological or theoretical inputs, which the researcher traverses and recombines to create knowledge. This synthetic approach makes the nature of the data, even at its ‘raw’ stage, quite hybrid, and already marked by the curatorial impulse that is preparing it to contribute to insight. This aspect may be more pronounced in the humanities than in other fields, but the subjective element is present in any human triggered process leading to the production or gathering of data. Another element of this is the emotional. Emotions are motivators for action and interaction that relate to social, cultural, economic and physiological needs and wants. Emotions are crucial factors in relating or disconnecting people from each other. They help researchers to experientially assess their environments, but this aspect of the research process is considered taboo, as noise that obscures the true ‘factual signal’, and as less ‘scientific’ (seen in terms of strictly Western colonialist paradigms of knowledge creation) than other possible contributors to scientific observation and analysis. Our primary ambition here was to explore the data creation processes of the humanities and related research fields to understand how they combine pools of information and other forms of intellectual processing to create data that resists datafication and ‘like-with-like’ federation with similar results. The insights gained will make visible many of the barriers to the inclusion of all aspects of science under current Open Science trajectories, and reveal further central elements of social and cultural knowledge that are unable to be accommodated under current conceptualisations of ‘data’ and the systems designed to use them.

4. Cultural data and representations of system limitations. Cultural signals are ambiguous, polysemic, often conflicting and contradictory. In order to transform culture into data, its elements – as all phenomena that are being reduced to data – have to be classified, divided, and filed into taxonomies and ontologies. This process of 'data-fication' robs them of their polysemy, or at least reduces it. One of the greatest challenges for so-called Big Data is the analysis and processing of multilingual content. This challenge is particularly acute for unstructured texts, which make up a large portion of the Big Data landscape. How do we deal with multilingualism in Big Data analysis? What are the techniques by which we can analyze unstructured texts in multiple languages, extracting knowledge from multilingual Big Data? Will new computational techniques such as AI deep learning improve or merely alter the challenges? The current method for analyzing multilingual Big Data is to leverage language technologies such as machine translation, terminology services, automated speech recognition, and content analytics tools. In recent years, the quality and accuracy of these key enabling technologies for Big Data has improved substantially, making them indispensable tools for high-demand applications with a global reach. However, just as not all languages are alike, the development of these technologies differs for each language. Larger languages with high populations have robust digital resources for their languages, the result of large-scale digitization projects in a variety of domains, including cultural heritage information. Smaller languages have resources that are much more scant. Those resources that do exist may be underpinned by far less robust algorithms and far smaller bases for the statistical modelling, leading to less reliable results, a fact that in large scale, multilingual environments (like Google translate) is often not made transparent to the user. The KPLEX project is exploring and describing the nature and potential for ‘data’ within these clearly defined sets of actors and practices at the margins of what is currently able to be approached holistically using computational methods. It is also envisioning approaches to the integration of hybrid data forms within and around digital platforms, leading not so much to the virtualisation of information generation as approaches to its augmentation.


11:30am - 11:45am
Short Paper (10+5min) [publication ready]

Open, Extended, Closed or Hidden Data of Cultural Heritage

Tuula Pääkkönen1, Juha Rautiainen1, Toni Ryynänen2, Eeva Uusitalo2

1National Library of Finland, Finland; 2Ruralia Institute, University of Helsinki, Finland

The National Library of Finland (NLF) agreed on an “Open National Library” policy in 2016[1]. In the policy there are eight principles, which are divided into accessibility, openness in actions and collaboration themes. Accessibility in the NLF means that access to the material needs to exist both for the metadata and the content, while respecting the rights of the rights holders. Openness in operations means that our actions and decision models are transparent and clear, and that the materials are accessible to the researchers and other users. These are one way in which the NLF can implement the findable, accessible, interoperable, re-usable (FAIR) data principles [2] themes in practise.

The purpose of this paper is to view the way in which the policy has impacted our work and how findability and accessibility have been implemented in particular from the aspects of open, extended, closed and hidden data themes. In addition, our aim is to specify the characteristics of existing and potential forms of data produced by the NLF from the research and development perspectives. A continuous challenge is the availability of the digital resources – gaining access to the digitised material for both researchers and the general public, since there are also constant requests for access to newer materials outside the legal deposit libraries’ work stations


11:45am - 12:00pm
Distinguished Short Paper (10+5min) [publication ready]

Aalto Observatory for Digital Valuation Systems

Jenni Huttunen1, Maria Joutsenvirta2, Pekka Nikander1

1Aalto University, Department of Communications and Networking; 2Aalto University, Department of Management Studies

Money is a recognised factor in creating sustainable, affluent societies. Yet, the neoclassical orthodoxy that prevails in our economic thinking remains as a contested area, its supporters claiming their results to be objective- ly true while many heterodox economists claim the whole system to stand on clay feet. Of late, the increased activity around complementary currencies suggest that the fiat money zeitgeist might be giving away to more variety in our monetary system. Rather than emphasizing what money does, as the mainstream economists do, other fields of science allow us to approach money as an integral part of the hierarchies and networks of exchange through which it circulates. This paper suggests that a broad understanding of money and more variety in monetary system have great potentials to further a more equalitarian and sustainable economy. They can drive the extension of society to more inclusive levels and transform people’s economic roles and identities in the process. New technologies, including blockchain and smart ledger technology are able to support decentralized money creation through the use of shared and “open” peer-to-peer rewarding and IOU systems. Alongside of specialists and decision makers’ capabilities, our project most pressingly calls for engaging citizens into the process early on. Multidisciplinary competencies are needed to take relevant action to investigate, envision and foster novel ways for value creation. For this, we are forming the Aalto Observatory on Digital Valuation Systems to gain deeper understandings of sustainable value creation structures enabled by new technology.


12:00pm - 12:15pm
Short Paper (10+5min) [publication ready]

Challenges and perspectives on the use of open cultural heritage data across four different user types: Researchers, students, app developers and hackers

Ditte Laursen1, Henriette Roued-Cunliffe2, Stig Svennigsen1

1Royal Danish Library; 2University of Copenhagen

In this paper, we analyse and discuss from a user perspective and from an organisational perspective the challenges and perspectives of the use of open cultural heritage data. We base our study on empirical evidence gathered through four cases where we have interacted with four different user groups: 1) researchers, 2) students, 3) app developers and 4) hackers. Our own role in these cases was to engage with these users as teachers, organizers and/or data providers. The cultural heritage data we provided were accessible as curated data sets or through API's. Our findings show that successful use of open heritage data is highly dependent on organisations' ability to calibrate and curate the data differently according to contexts and settings. More specifically, we show what different needs and motivations different user types have for using open cultural heritage data, and we discuss how this can be met by teachers, organizers and data providers.


12:15pm - 12:30pm
Short Paper (10+5min) [abstract]

Synergy of contexts in the light of digital humanities: a pilot study

Monika Porwoł

State University of Applied Sciences in Racibórz

The present paper describes a pilot study pertaining to the linguistic analysis of meaning with regard to the word ladder[EN]/drabina[PL] taking into account views of digital humanities. Therefore, WordnetLoom mapping is introduced as one of the existing research tools proposed by CLARIN ERIC research and technology infrastructure. The explicated material comprises retrospective remarks and interpretations provided by 74 respondents, who took part in a survey. A detailed classification of multiple word’s meanings is presented in a tabular way (showing the number of contexts, in which participants accentuate the word ladder/drabina) along with some comments and opinions. Undoubtedly, the results suggest that apart from the general domain of the word offered for consideration, most of its senses can usually be attributed to linguistic recognitions. Moreover, some perspectives on the continuation of future research and critical afterthoughts are made prominent in the last part of this paper.

 
2:00pm - 3:30pmT-PIII-2: Language Resources
Session Chair: Kaius Sinnemäki
PIII 
 
2:00pm - 2:30pm
Long Paper (20+10min) [publication ready]

Sentimentator: Gamifying Fine-grained Sentiment Annotation

Emily Sofi Öhman, Kaisla Kajava

University of Helsinki

We introduce Sentimentator; a publicly available gamified web-based annotation platform for fine-grained sentiment annotation at the sentence-level. Sentimentator is unique in that it moves beyond binary classification. We use a ten-dimensional model which allows for the annotation of over 50 unique sentiments and emotions. The platform is heavily gamified with a scoring system designed to reward players for high quality annotations. Sentimentator introduces several unique features that have previously not been available, or at best very limited, for sentiment annotation. In particular, it provides streamlined multi-dimensional annotation optimized for sentence-level annotation of movie subtitles. The resulting dataset will allow for new avenues to be explored, particularly in the field of digital humanities, but also knowledge-based sentiment analysis in general. Because both the dataset and platform will be made publicly available it will benefit anyone and everyone interested in fine-grained sentiment analysis and emotion detection, as well as annotation of other datasets.


2:30pm - 2:45pm
Distinguished Short Paper (10+5min) [publication ready]

Defining a Gold Standard for a Swedish Sentiment Lexicon: Towards Higher-Yield Text Mining in the Digital Humanities

Jacobo Rouces, Lars Borin, Nina Tahmasebi, Stian Rødven Eide

University of Gothenburg

There is an increasing demand for multilingual sentiment analysis, and most work on sentiment lexicons is still carried out based on English lexicons like WordNet. In addition, many of the non-English sentiment lexicons that do exist have been compiled by (machine) translation from English resources,

thereby arguably obscuring possible language-specific characteristics of sentiment-loaded vocabulary.

In this paper we describe the creation of a gold standard for the sentiment annotation of Swedish terms as a first step towards the creation of a full-fledged sentiment lexicon for Swedish -- i.e., a lexicon containing information about prior sentiment (also called polarity) values of lexical items (words or disambiguated word senses), along a scale negative--positive. We create a gold standard for sentiment annotation of Swedish terms, using the freely available SALDO lexicon and the Gigaword corpus. For this purpose, we employ a multi-stage approach combining corpus-based frequency sampling and two stages of human annotation: direct score annotation followed by Best-Worst Scaling. In addition to obtaining a gold standard, we analyze the data from our process and we draw conclusions about the optimal sentiment model.


2:45pm - 3:00pm
Short Paper (10+5min) [publication ready]

The Nordic Tweet Stream: A dynamic real-time monitor corpus of big and rich language data

Mikko Laitinen1, Jonas Lundberg2, Magnus Levin3, Rafael Martins4

1University of Eastern Finland; 2Linnaeus University; 3Linnaeus University; 4Linnaeus University

This article presents the Nordic Tweet Stream (NTS), a cross-disciplinary corpus project of computer scientists and a group of sociolinguists interested in language variability and in the global spread of English. Our research integrates two types of empirical data: We not only rely on traditional structured corpus data but also use unstructured data sources that are often big and rich in metadata, such as Twitter streams. The NTS downloads tweets and associated metadata from Denmark, Finland, Iceland, Norway and Sweden. We first introduce some technical aspects in creating a dynamic real-time monitor corpus, and the fol-lowing case study illustrates how the corpus could be used as empirical evidence in sociolinguistic studies focusing on the global spread of English to multilingual settings. The results show that English is the most frequently used language, accounting for almost a third. These results can be used to assess how widespread English use is in the Nordic region and offer a big data perspective that complement previous small-scale studies. The future objectives include annotating the material, making it available for the scholarly community, and expanding the geographic scope of the data stream outside Nordic region.


3:00pm - 3:15pm
Short Paper (10+5min) [publication ready]

Best practice for digitising small-scale Digital Humanities projects

Peggy Bockwinkel, Dîlan Cakir

University of Stuttgart, Germany

Digital Humanities (DH) are growing rapidly; the necessary infrastructure

is being built up gradually and slowly. For smaller DH projects, e. g. for

testing methods, as a preliminary work for submitting applications or for use in

teaching, a corpus often has to be digitised. These small-scale projects make an

important contribution to safeguarding and making available cultural heritage, as

they make it possible to machine read those resources that are of little or no interest

to large projects because they are too special or too limited in scope. They

close the gap between large scanning projects of archives, libraries or in connection

with research projects and projects that move beyond the canonised paths.

Yet, these small projects can fail in this first step of digitisation, because it is

often a hurdle for (Digital) Humanists at universities to get the desired texts digitised:

either because the digitisation infrastructure in libraries/archives is not

available (yet) or it is paid service. Also, researchers are often no digitising experts

and a suitable infrastructure at university is missing.

In order to promote small DH projects for teaching purposes, a digitising infrastructure

was set up at the University of Stuttgart as part of a teaching project. It

should enable teachers to digitise smaller corpora autonomously.

This article presents a study that was carried out as part of this teaching project.

It suggests how to implement best practices and on which aspects of the digitisation

workflow need to be given special attention.

The target group of this article are (Digital) Humanists who want to digitise a

smaller corpus. Even with no expertise in scanning and OCR and no possibility

to outsource the digitisation of the project, they still would like to obtain the best

possible machine-readable files.


3:15pm - 3:30pm
Distinguished Short Paper (10+5min) [publication ready]

Creating and using ground truth OCR sample data for Finnish historical newspapers and journals

Kimmo Kettunen, Jukka Kervinen, Mika Koistinen

University of Helsinki, Finland,

The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12 million pages mainly in Finnish and Swedish. Out of these about 5.1 million pages are freely available on the web site digi.kansalliskirjasto.fi. The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1920. The last ten years, 1911–1920, were opened in February 2017.

This paper presents the ground truth Optical Character Recognition data of about 500 000 Finnish words that has been compiled at NLF for development of a new OCR process for the collection. We discuss compilation of the data and show basic results of the new OCR process in comparison to current OCR using the ground truth data.

 
4:00pm - 5:30pmT-PIII-3: Computational Literary Analysis
Session Chair: Mads Rosendahl Thomsen
PIII 
 
4:00pm - 4:30pm
Long Paper (20+10min) [abstract]

A Computational Assessment of Norwegian Literary “National Romanticism”

Ellen Rees

University of Oslo,

In this paper, I present findings derived from a computational analysis of texts designated as “National Romantic” in Norwegian literary historiography. The term “National Romantic,” which typically designates literary works from approximately 1840 to 1860 that are associated with national identity formation, first appeared decades later, in Henrik Jæger’s Illustreret norsk litteraturhistorie from 1896. Cultural historian Nina Witoszek has on a number of occasions written critically about the term, claiming that it is misleading because the works it denotes have little to do with larger international trends in Romanticism (see especially Witoszek 2011). Yet, with the exception of a 1985 study by Asbjørn Aarseth, it has never been interrogated systematically in the way that other period designations such as “Realism” or “Modernism” have. Nor does Aarseth’s investigation attempt to delimit a definitive National Romantic corpus or account for the remarkable disparity among the works that are typically associated with the term. “National Romanticism” is like pornography—we know it when we see it, but it is surprisingly difficult to delineate in a scientifically rigorous way.

Together with computational linguist Lars G. Johnsen and research assistants Hedvig Solbakken and Thomas Rasmussen, I have prepared a corpus of 217 text that are mentioned in connection with “National Romanticism” in the major histories of Norwegian literature and textbooks for upper secondary instruction in Norwegian literature. I will discuss briefly some of the logistical challenges associated with preparing this corpus.

This corpus forms the point of departure for a computational analysis employing various text-mining methods in order to determine to what degree the texts most commonly associated with “National Romanticism” share significant characteristics. In the popular imagination, the period is associated with folkloristic elements such as supernatural creatures (trolls, hulders), rural farming practices (shielings, herding), and folklife (music, rituals) as well as nature motifs (birch trees, mountains). We therefore employ topic modeling in order to map the frequency and distribution of such motifs across time and genre within the corpus. We anticipate that topic modeling will also reveal unexpected results beyond the motifs most often associated with National Romanticism. This process should prepare us to take the next step and, inspired by Matthew Wilkens’ recent work generating “clusters” of varieties within twentieth-century U.S. fiction, create visualizations of similarities and differences among the texts in the National Romanticism corpus (Wilkens 2016).

Based on these initial computational methods, we hope to be able to answer some of the following literary historical questions:

¥ Are there identifiable textual elements shared by the texts in the National Romantic canon?

¥ What actually defines a National Romantic text as National Romantic?

¥ Do these texts cluster in a meaningful way chronologically?

¥ Is “National Romanticism” in fact meaningful as a period designation, or alternately as a stylistic designation?

¥ Are there other texts that share these textual elements that are not in the canon?

¥ If so, why? Do gender, class or ethnicity have anything to do with it?

To answer the last two questions, we need to use the “National Romanticism” corpus as a sub-corpus and “trawl-line” within the full corpus of nineteenth-century Norwegian textual culture, carrying out sub-corpus topic modeling (STM) in order to determine where similarities with texts from outside the period 1840–1860 arise (Tangherlini and Leonard 2013). For the sake of expediency, we use the National Library of Norway’s Digital Bookshelf as our full corpus, though we are aware that there are significant subsets of Norwegian textual culture that are not yet included in this corpus. Despite certain limitations, the Digital Bookshelf is one of the most complete digital collections of a national textual culture currently available.

For the purposes of DHN 2018, this project might best be categorized as an exploration of cultural heritage, understood in two ways. On the one hand, the project is entirely based on the National Library of Norway’s Digital Bookshelf platform, which, as an attempt to archive as much as possible of Norwegian textual culture in a digital and publicly accessible archive, is in itself a vehicle for preserving cultural heritage. On the other hand, the concept of “National Romanticism” is arguably the most widespread, but least critically examined means of linking cultural heritage in Norway to a specifically nationalist agenda.

References:

Jæger, Henrik. 1896. Illustreret norsk litteraturhistorie. Bind II. Kristiania: Hjalmar Biglers forlag.

Tangherlini, Timothy R. and Peter Leonard. 2013. “Trawling in the Sea of the Great Unread: Sub-Corpus Topic Modeling and Humanities Research.” Poetics 41.6: 725–749.

Wilkens, Matthew. 2016. “Genre, Computation, and the Varieties of Twentieth-Century U.S. Fiction.” CA: Journal of Cultural Analytics (online open-access)

Witoszek, Nina. 2011. The Origins of the “Regime of Goodness”: Remapping the Cultural History of Norway. Oslo: Universitetsforlaget.

Aarseth, Asbjørn. 1985. Romantikken som konstruksjon: tradisjonskritiske studier i nordisk litteraturhistorie. Bergen: Universitetsforlaget.


4:30pm - 4:45pm
Short Paper (10+5min) [abstract]

Prose Rhythm in Narrative Fiction: the case of Karin Boye's Kallocain

Carin Östman, Sara Stymne, Johan Svedjedal

Uppsala university,

Prose Rhythm in Narrative Fiction: the case of Karin Boye’s Kallocain

Swedish author Karin Boye’s (1900-1941) last novel Kallocain (1940) is an icily dystopian depiction of a totalitarian future. The protagonist Leo Kall first embraces this system, but for various reasons rebels against it. The peripety comes when he gives a public speech, questioning the State. It has been suggested (by the linguist Olof Gjerdman) that the novel – which is narrated in the first-person mode – from exactly this point on is characterized by a much freer rhythm (Gjerdman 1942). This paper sets out to test this hypothesis, moving on from a discussion of the concept of rhythm in literary prose to an analysis of various indicators in different parts of Kallocain and Boye’s other novels.

Work on this project started just a few weeks ago. So far we have performed preliminary experiments with simple text quality indicators, like word length, sentence length, and the proportion of punctuation marks. For all these indicators we have compared the first half of the novel, up until the speech, the second half of the novel, and as a contrast also the "censor's addendum", which is a short last chapter of the novel, written by an imaginary censor. For most of these indicators we find no differences between the two major parts of the novel. The only result that points to a more strict rhythm in the first half is that the proportion of long words, both as counted in characters and syllables, are considerably higher there. For instance, the percentage of words with at least five syllables is 1.85% in the first half, and 1.03% in the second half.

The other indicators with a difference does not support the hypothesis, however. In the first half, the sections are shorter, there is proportionally more speech utterances, and there is a higher proportion of three consecutive dots (...), which are often used to mark hesitation. If we compare these two halves to the censor's addendum, however, we can clearly see that the addendum is written in a stricter way, with for instance a considerably higher proportion of long words (4.90% of the words have more than five syllables) and more than double as long sentences.

In future analysis, we plan to use more fine-tuned indicators, based on a dependency parse of the text, from which we can explore issues like phrase length and the proportion of sub-clauses. Separating out speech from non-speech also seems important. We also plan to explore the variation in our indicators, rather than just looking at averages, since this has been suggested in literature on rhythm in Swedish prose (Holm 2015).

Through this initial analysis we have also learned about some of the challenges of analyzing literature. For instance, it is not straightforward to separate speech from non-speech, since the end of utterances are often not clearly marked in Kallocain, and free indirect speech is sometimes used. We think this would be important for future analysis, as well as attribution of speech (Elson & McKeown, 2010), since the speech of the different protagonists cannot be expected to vary in the two parts to the same degree.

References

Boye, Karin (1940) Kallocain: roman från 2000-talet. Stockholm: Bonniers.

Elson, David K. and McKeown, Kathleen R. (2010) Automatic Attribution of Quoted Speech in Literary Narrative. In Proceedings of the 24th AAAI Conference on Artificial Intelligence. The AAAI Press, Menlo Park, pp 1013–1019.

Gjerdman, Olof (1942) Rytm och röst. In Karin Boye. Minnen och studier. Ed. by M. Abenius and O. Lagercrantz. Stockholm: Bonniers, pp 143–160.

Holm, Lisa (2015) Rytm i romanprosa. In Det skönlitterära språket. Ed. by C. Östman. Stockholm: Morfem, pp 215–235.

Authors: Sara Stymne, Johan Svedjedal, Carin Östman (Uppsala University)


4:45pm - 5:00pm
Short Paper (10+5min) [abstract]

The Dostoyevskian Trope: State Incongruence in Danish Textual Cultural Heritage

Kristoffer Laigaard Nielbo, Katrine Frøkjær Baunvig

University of Southern Denmark,

In the history of prolific writers, we are often confronted with the figure of the suffering or tortured

writer. Setting aside metaphysical theories, the central claim seems to be that a state incongruent

dynamic is an intricate part of the creativty process. Two propositions can be derived this claim,

1: the creative state is inversely proportional to the emotional state, and 2: the creative state is

causally predicted by the emotional state. We call this the creative-emotional dynamic ‘The

Dostojevskian Trope’. In this paper we present a method for studying the dostojevskian trope in

prolific writers. The method combines Shannon entropy as an indicator of lexical density and

readability with fractal analysis in order to measure creative dynamics over multiple documents.

We generate a sentiment time series from the same documents and test for causal dependencies

between the creative and sentiment time series. We illustrate the method by searching for the

dostojevskian trope in Danish textual cultural heritage, specifically three highly prolific writers

from the 19th century, namely, N.F.S. Grundtvig, H.C. Andersen, and S.A. Kierkegaard.


5:00pm - 5:30pm
Long Paper (20+10min) [abstract]

Interdisciplinary advancement through the unexpected: Mapping gender discourses in Norway (1840-1913) with Bokhylla

Heidi Karlsen

University of Oslo,

Abstract for long format presentation

Heidi Karlsen, University of Oslo

Ph.D. Candidate in literature, Cand.philol. in philosophy

Interdisciplinary advancement through the unexpected: Mapping gender discourses in Norway (1840-1913) with Bokhylla

This presentation discusses challenges related to sub-corpus topic modeling in the study of gender discourses in Norway from 1840 till 1913 and the role of interdisciplinary collaboration in this process. Through collaboration with the Norwegian National Library, data-mining techniques are used in order to retrieve data from the digital source, Bokhylla [«the Digital Bookshelf»], for the analysis of women’s «place» in society and the impact of women writers on this discourse. My project is part of the research project «Data-mining the Digital Bookshelf», based at the University of Oslo.

1913, the closing year of the period I study, is the year of women’s suffrage in Norway. I study the impact women writers had on the debate in Norway regarding women’s «place» in society, during the approximately 60 years before women were granted the right to vote. A central hypothesis for my research is that women writers in the period had an underestimated impact on gender discourses, especially in defining and loading key words with meaning (drawing on mainly Norman Fairclough’s theoretical framework for discourse analysis). In this presentation, I examine a selection of Swedish writer Fredrika Bremer’s texts, and their impact on gender discourses in Norway.

The Norwegian National Library’s Digital Bookshelf, is the main source for the historical documents I use in this project. The Digital Bookshelf includes a vast amount of text published in Norway over several centuries, text of a great variety of genres, and offers thus unique access to our cultural heritage. Sub-corpus topic modeling (STM) is the main tool that has been used to process the Digital Bookshelf texts for this analysis. A selection of Bremer’s work has been assembled into a sub-corpus. Topics have then been generated from this corpus and then applied to the full Digital Bookshelf corpus. During the process, the collaboration with the National Library has been essential in order to overcome technical challenges. I will reflect upon this collaboration in my presentation. As the data are retrieved, then analyzed by me as a humanities scholar, and weaknesses in the data are detected, the programmer, at the National Library assisting us on the project, presents, modifies and develops tools in order to meet our challenges. These tools might in turn represent additional possibilities beyond what they were proposed for. New ideas in my research design may emerge as a result. Concurrently, the algorithms created at such a stage in the process, might successively be useful for scholars in completely different research projects. I will mention a few examples of such mutually productive collaborations, and briefly reflect upon how these issues are related to questions regarding open science.

In this STM process, several challenges have emerged along the way, mostly related to OCR errors. Some illustrative examples of passages with such errors will be presented for the purpose of discussing the measures undertaken to face the problems they give rise to, but also for demonstrating the unexpected progress stemming from these «defective» data. The topics used as a «trawl line»(1), in the initial phase of this study, produced few results. Our first attempt to get more results was to revise down the required Jaccard similarity(2). This entails that the quantity of a topic that had to be identified in a passage in order for it to qualify as a hit, is lowered. As this required topic quantity was lowered, a great number of results were obtained. The obvious weakness of these results, however, is that the rather low required topic match, or relatively low value of the required Jaccard similarity, does not allow us to affirm a connection between these passages and Bremer’s text. Nevertheless, the results have still been useful, for two reasons. Some of the data have proven to be valuable sources for the mapping of gender discourses, although not indicating anything regarding women writer’s impact on them. Moreover, these passages have served to illustrate many of the varieties of OCR errors that my topic words give rise to in text from the period I study (frequently in Gothic typeface). This discovery has then been used to improve the topics, which takes us to the next step in the process.

In certain documents one and the same word in the original text has, in the scanning of the document, given rise to up to three different examples of OCR errors(3). This discovery indicates the risk of missing out on potentially relevant documents in the «great unread»(4). If only the correct spelling of the words is included in the topics, potentially valuable documents with our topic words in them, bizarrely spelled because of errors in the scanning, might go unnoticed. In an attempt to meet this challenge I have manually added to the topic the different versions of the words that the OCR errors have given rise to (for instance for the word «kjærlighed» [love] «kjaerlighed», «kjcerlighed», «kjcrrlighed»). We cannot in that case, when we run the topic model, require a one hundred percent topic match, perhaps not even 2/3, as all these OCR errors of the same word are highly unlikely to take place in all potential matches(5). Such extensions of the topics, condition in other words our parameterization of the algorithm: the required value of Jaccard similarity for a passage to be captured has to be revised fairly down. The inconvenience of this approach, however, is the possible high number of captured passages that are exaggeratedly (for our purpose) saturated with the semantic unit in question. Furthermore, if we add to this the different versions of a lexeme and its semantic relatives that in some cases are included in the topic, such as «kvinde», «kvinder», «kvindelig», kvindelighed» [woman, women, feminine, femininity], the topic in question might catch an even larger number of passages with a density of this specific semantic unity with its variations; this is an amount that is not proportional to the overall variety of the topic in question.

This takes us back to the question of what we program the “trawl line” to “require” in order for a passage in the target corpus to qualify as a hit, and as well to how the scores are ranged. How many of the words in the topic, and to what extent do several occurrences of one of the topic’s words, i.e., five occurrences of “woman” in one paragraph interest us? The parameter can be set to range scores in function of the occurrences of the different words forming the topic, meaning that the score for a topic in a captured passage is proportional to the heterogeneity of the occurrences of the topic’s words, not only the quantity. However, in some cases we might, as mentioned, have a topic comprehending several forms of the same lexeme and its semantic relatives and, as described, several versions of the same word due to OCR errors. How can the topic model be programmed in order to take into account such occurrences in the search for matching passages? In order to meet this challenge, a «hyperlexeme sensitive» algorithm has been created (6). This means that the topic model is parameterized to count the lexeme frequency in a passage. It will also range the scores in function of the occurrence of the hyperlexeme, and not treat occurrences of different forms of one lexeme equally to the ones of more semantically heterogenous word-units in the topic. Furthermore, and this is the point to be stressed, this algorithm is programmed to treat miss-spelling of words, due to OCR errors, as if they were different versions of the same hyperlexeme.

The adjustments of the value of the Jaccard similarity and the hyperlexeme parameterization are thus measures conducted in order to compensate for the mentioned inconveniences, and improve and refine the topic model. I will show examples that compare the before and after these parameters were used, in order to discuss how much closer we have got to be able to establish actual links between the sub-corpus, and passages the topics have captured in the target corpus. All the technical concepts will be defined and briefly explained as I get to them in the presentation. The genesis of these measures, tools and ideas at crucial moments in the process, taking place as a result of unexpected findings and interdisciplinary collaboration, will be elaborated on in my presentation, as well as the potential this might offer for new research.

Notes:

(1) My description of the STM process, with the use of tropes such as «trawl line» is inspired by Peter Leonard and Timothy R. Tangherlini (2013): “Trawling in the Sea of the Great Unread: Sub-corpus topic modeling and Humanities research” in Poetics. 41, 725-749

(2) The Jaccard index is taken into account in the ranging of the scores. The best hit passage for a topic, the one with highest score, will be the one with highest relative similarity to the other captured passages, in terms of concentration of topic words in the passage. The parameterized value of the required Jaccard similarity defines the score a passage must receive in order to be included in the list of captured passages from the «great unread».

(3) Some related challenges were described by Kimmo Kettunen and Teemu Ruokolainen in their presentation, «Tagging Named Entities in 19th century Finnish Newspaper Material with a Variety of Tools» at DHN2017.

(4) Franco Moretti (2000) (drawing on Margareth Cohen) calls the enormous amount of works that exist in the world for «the great unread» (limited to Bokhylla’s content in the context of my project) in: «Conjectures of World Literature» in New Left Review. 1, 54-68.

(5) As an alternative to include in the topic all detected spelling variations, due to OCR errors, of the topic words, we will experiment with taking into account the Levenshtein distance when programming the «trawl line». In that case it is not identity between a topic word and a word in a passage in the great unread that matters, but the distance between two words, the minimum number of single-character edits required to change one word into the other, for instance «kuinde»-> «kvinde».

(6) By the term «hyperlexeme» we understand a collection of graphemic occurences of a lexeme, including spelling errors and semantically related forms.