Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Session Overview
Date: Wednesday, 07/Mar/2018
10:00am - 12:30pmRegistration
Lobby, Porthania, Yliopistonkatu 3
12:30pm - 2:00pmLunch
Main Building of the University, entrance from the Senate Square side
2:00pm - 2:15pmWelcome
2:15pm - 3:30pmPlenary 1: Alan Liu
Session Chair: Mikko Tolonen
Open and Re­pro­du­cible Work­flows for the Di­gital Hu­man­it­ies – A 10,000 Meter El­ev­a­tion View
3:30pm - 4:00pmCoffee break
Lobby, Porthania
4:00pm - 5:30pmW-PI-1: New Media
Session Chair: Bente Maegaard
4:00pm - 4:30pm
Long Paper (20+10min) [publication ready]

Skin Tone Emoji and Sentiment on Twitter

Steven Coats

University of Oulu

In 2015, the Unicode Consortium introduced five skin tone emoji that can be used in combination with emoji representing human figures and body parts. In this study, use of the skin tone emoji is analyzed geographically in a large sample of data from Twitter. It can be shown that values for the skin tone emoji by country correspond approximately to the skin tone of the resident populations, and that a negative correlation exists between tweet sentiment and darker skin tone at the global level. In an era of large-scale migrations and continued sensitivity to questions of skin color and race, understanding how new language elements such as skin tone emoji are used can help frame our understanding of how people represent themselves and others in terms of a salient personal appearance attribute.

Coats-Skin Tone Emoji and Sentiment on Twitter-238_a.pdf
Coats-Skin Tone Emoji and Sentiment on Twitter-238_c.pdf

4:30pm - 4:45pm
Distinguished Short Paper (10+5min) [abstract]

“Memes” as a Cultural Software in the Context of the (Fake) Wall between the US and Mexico

Martin Camps

University of the Pacific,

Memes function as “digital graffiti” in the streets of social media, a cultural electronic product that satirizes those in power. The success of a meme is measured by “going viral” and reproduced like a germ. I am interested in analyzing these eckphrastic texts in the context of the construction of the wall between the US and Mexico. I examine popular memes in Mexico and in the US from both sides of the political spectrum. I believe these “political haikus” work as an escape valve to the tensions generated in the cultural wars that consume American politics. The border is an “open wound” (as Mexican writer Carlos Fuentes said) that was opened after the War of 1847 that resulted in Mexico losing half of its territory. Now the wall functions as a political membrane to keep out the “expelled citizens” of the Global South from the economic benefits of the North. Memes help to expunge the gravity of a two-thousand-mile concrete wall in a region that shares cultural traits, languages, and an environment that cannot be domesticated with monuments to hatred. Memes are rhetorical devices that convey the absurdity of a situation, as in one example, of a border wall with an enormous “piñata” that infantilizes the State-funded project of a fence. The meme’s iconoclastography set in motion a discussion of the real issues at hand, global economic disparities and the human right to migrate on this small planet of ours.

Camps-“Memes” as a Cultural Software in the Context-168_a.pdf
Camps-“Memes” as a Cultural Software in the Context-168_c.pdf

4:45pm - 5:00pm
Short Paper (10+5min) [publication ready]

A Mixed Methods Analysis of Local Facebook Groups in Helsinki

Matti Autio


In Helsinki, the largest city of Finland, local Facebook groups have become increasingly popular. Communities in Helsinki have developed a virtual existence as a part of everyday life. More than 50 discussion groups exist that are tied to a certain residential district. Local flea market groups are even more common. The membership of local Facebook groups totals at least 300 000 in a city of little more than half a million. The content of discussion groups was studied using a mixed methods approach. The qualitative results give a typology of local Facebook groups and an insight to the reoccurring topics of posts. The quantitative study reveals significant differences in the amount of local social control exerted through the Facebook group. The discussion groups are used for social control more prominently in areas with a lot of detached housing. In high rise districts the new networks are used mainly for socially cohesive cooperation. The social cohesion and control of local Facebook groups strengthens the community’s collective efficacy.

Autio-A Mixed Methods Analysis of Local Facebook Groups-257_a.pdf

5:00pm - 5:15pm
Short Paper (10+5min) [publication ready]

Medicine Radar – Discovering How People Discuss Their Health

Krista Lagus1, Minna Ruckenstein2, Atte Juvonen3, Chang Rajani3

1Faculty of Social Sciences, University of Helsinki, Finland,; 2Consumer Society Research Centre, University of Helsinki, Finland; 3Futurice Oy

In order to understand how people discuss their health and their use of medicines on-line, we studied Health in the Suomi24. Upon the analysis of 19 million comments containing 200 million words, a concept vocabulary of medicines and of symptoms was derived from colloquial discussions utilizing a mixture of unsupervised learning, human input and linguistic analyses. We present the method and a tool for browsing the health discussions, and the relations of medicines, symptoms and dosages.

Lagus-Medicine Radar – Discovering How People Discuss Their Health-273_a.pdf

5:15pm - 5:30pm
Short Paper (10+5min) [publication ready]

(Re)Branching Narrativity: Virtual Space Experience in Twitch

Ilgin Kizilgunesler

University of Manitoba, Canada

Twitch, as an online platform for gamers, has been analyzed in terms of its commercial benefits for the increase of game sales and its role in bringing fame to streamers. By focusing on Twitch’s interactive capacity, this paper compares this platform to narrative games, playable stories, and mobile narratives in terms of the role of the user(s) and their virtual space experience. Drawing on theories by Marie-Laure Ryan in "From Narrative Games to Playable Stories: Toward a Poetics of Interactive Narrative"and Rita Raley in "Walk This Way: Mobile Narrative as Composed Experience”, the paper argues that Twitch assigns authorial roles to the users (i.e., the streamers and the subscribers), who branch the existing narrative of the game by determining the path of the setting collectively. By doing so, the paper proposes Twitch as a space, which extends the immersion that is discussed around such interactive forms (i.e., narrative games, playable stories, and mobile narratives).

Kizilgunesler-(Re)Branching Narrativity-149_a.pdf
Kizilgunesler-(Re)Branching Narrativity-149_c.pdf
4:00pm - 5:30pmW-PII-1: Historical Texts
Session Chair: Asko Nivala
4:00pm - 4:30pm
Long Paper (20+10min) [abstract]

Diplomatarium Fennicum and the digital research infrastructures for medieval studies

Seppo Eskola, Lauri Leinonen

National Archives of Finland,

Digital infrastructures for medieval studies have advanced in strides in Finland over the last few years. Most literary sources concerning medieval Finland − the Diocese of Åbo − are now available online in one form or another: Diplomatarium Fennicum encompasses nearly 7 000 documentary sources, the Codices Fennici project recently digitized over 200 mostly well-preserved pre-17th century codices and placed them online, and Fragmenta Membranea contains digital images of 9 300 manuscript leaves belonging to over 1 500 fragmentary manuscripts. In terms of availability of sources, the preconditions for research have never been better. So, what’s next?

This presentation discusses the current state of digital infrastructures for medieval studies and their future possibilities. For the past two and a half years the presenters have been working on the Diplomatarium Fennicum webservice, published in November 2017, and the topic is approached from this background. Digital infrastructures are being developed on many fronts in Finland: several memory institutions are actively engaged (the three above-mentioned webservices are developed and hosted by the National Archives, The Finnish Literature Society, and the National Library respectively) and many universities have active medieval studies programs with an interest in digital humanities. Furthermore, interest in Finnish digital infrastructures is not restricted to Finland as Finnish sources are closely linked to those of other Nordic countries and the Baltic Sea region in general. In our presentation, we will compare the different Finnish projects, highlight opportunities for international co-operation, and discuss choices (e.g. selecting metadata models) that could best support collaboration between different services and projects.

Eskola-Diplomatarium Fennicum and the digital research infrastructures-251_a.pdf
Eskola-Diplomatarium Fennicum and the digital research infrastructures-251_c.pdf

4:30pm - 4:45pm
Short Paper (10+5min) [publication ready]

The HistCorp Collection of Historical Corpora and Resources

Eva Pettersson, Beáta Megyesi

Uppsala University

We present the HistCorp collection, a freely available open platform aiming at the distribution of a wide range of historical corpora and other useful resources and tools for researchers and scholars interested in the study of historical texts. The platform contains a monitoring corpus of historical texts from various time periods and genres for 14 European languages. The collection is taken from well-documented historical corpora, and distributed in a uniform, standardised format. The texts are downloadable as plaintext, and in a tokenised format. Furthermore, some texts are normalised with regard to spelling, and some are annotated with part-of-speech and syntactic structure. In addition, preconfigured language models and spelling normalisation tools are provided to allow the study of historical languages.

Pettersson-The HistCorp Collection of Historical Corpora and Resources-135_a.pdf
Pettersson-The HistCorp Collection of Historical Corpora and Resources-135_c.pdf

4:45pm - 5:00pm
Short Paper (10+5min) [publication ready]

Semantic National Biography of Finland

Eero Hyvönen1,2, Petri Leskinen1, Minna Tamper1,2, Jouni Tuominen2,1, Kirsi Keravuori3

1Aalto University; 2University of Helsinki (HELDIG); 3Finnish Literature Society (SKS)

This paper presents the idea and project of transforming and using

the textual biographies of the National Biography of Finland, published by the

Finnish Literature Society, as Linked (Open) Data. The idea is to publish the lives as semantic, i.e., machine “understandable” metadata in a SPARQL endpoint in the Linked Data Finland ( service, on top of which various Digital Humanities applications are built. The applications include searching and studying individual personal histories as well as historical research of groups of persons using methods of prosopography. The basic biographical data is enriched by extracting events from unstructured texts and by linking entities internally and to external data sources. A faceted semantic search engine is provided for filtering groups of people from the data for research in Digital Humanities. An extension of the event-based CIDOC CRM ontology is used as the underlying data model, where lives are seen as chains of interlinked events populated from the data of the biographies and additional sources, such as museum collections, library databases, and archives.

Hyvönen-Semantic National Biography of Finland-203_a.pdf
Hyvönen-Semantic National Biography of Finland-203_c.pdf

5:00pm - 5:15pm
Short Paper (10+5min) [abstract]

Creating a corpus of communal court minute books: a challenge for digital humanities

Maarja-Liisa Pilvik1, Gerth Jaanimäe1, Liina Lindström1, Kadri Muischnek1, Kersti Lust2

1University of Tartu, Estonia,; 2The National Archives of Estonia, Estonia

This paper presents the work of a digital humanities project concerned with the digitization of Estonian communal court minute books. The local communal courts in Estonia came into being through the peasant laws of the early 19th century and were the first instance class-specific courts, that tried peasants. Rather than being merely judicial institutions, the communal courts were at first institutions for the self-government of peasants, since they also dealt with police and administrative matters. After the municipal reform of 1866, however, the communal courts were emancipated from the noble tutelage and the court became a strictly judicial institution, that tried peasants for their minor offences and solved their civil disputes, claims and family matters. The communal courts in their earlier form ceased to exist in 1918, when Estonia became independent from the Russian rule.

The National Archives of Estonia holds almost 400 archives of communal courts from the pre-independence period. They have been preserved very unevenly and not all of them include minute books. The minute books themselves are also written in an inconsistent manner, the earlier minute books are often written in German and the writing is strongly dependent on the skills and will of the parish clerk. However, the materials from the period starting with the year 1866, when the creation of the minute books became more systematic, are a massive and rich source shedding light on the everyday lives of the peasantry. Still, at the moment, the users of the minute books meet serious difficulties in finding relevant information since there are no indexes and one has to go through all the materials manually. The minute books are also a fascinating resource for linguists, both dialectologists and computational linguists: the books contain regional varieties tied to specific genre and early time period (making it possible to detect linguistic expressions, which are rare in atlases, for example, and also in dialect corpus, which represents language from about 100 years later) while also being a written resource, reflecting the writing traditions of the old spelling system. This is also what makes these texts complex and challenging for automatic analysis methods, which are otherwise quite well-established in contemporary corpus linguistics.

In our talk we present a project dealing with the digitization and analysis of the minute books from the period between 1866 and 1890. The texts were first digitized in the 2000s and preserved in a server in html-format, which is good for viewing, but not as good for automatic processing. After the server crashed, the texts were rescued via web archives and the structure of the minute books was used to convert the documents automatically into a more functional format using xml-markup and separating the body text with tags referring to information about the titles, dates, indexes, participants, content and topical keywords, which indicate the purview of the communal courts in that period.

We discuss the workflow of creating a digital resource in a standardized and maximally functional format as well as challenges, such as automatic text processing for cleaning and annotating the corpus in order to distinguish the relevant layers of information. In order to enable queries with different degrees of specificity in the corpus, the texts also need to be linguistically analyzed. For both named entity recognition (NER), which enables network analysis and links the events described in the materials to geospatial locations, and morphological annotation, which makes it possible to perform queries based on lemmas or grammatical information, we have applied the Estnltk library in Python, which is developed for contemporary written standard Estonian. For NER, its performance was satisfactory, i.e. it found recognized names well, even though it systematically overrecognized organization names. The most complicated issue so far is the morphological analysis and disambiguation of word forms. Tools developed for Estonian morphological analysis, such as Estnltk or Vabamorf, are trained on contemporary written standard Estonian. Communal court minute books, however, include language variants, which are a mixture of dialectal language, inconsistent spelling and the old spelling system. In the presentation, we introduce the results of our first attempts to apply Estnltk tools to the materials of communal court minute books, the problems that we’ve run into, and provide solutions for overcoming these problems.

The final aim of the project is to create a multifunctional source, which could be of interest for researchers of different fields within the humanities. As the National Archives have a considerable amount of communal court minute books, which are thus far only in a scanned form, the digitized minute books collection is planned to expand using crowdsourcing oportunities.


Estnltk. Open source tools for Estonian natural language processing;

Vabamorf. Eesti keele morfanalüsaator [‘The morphological analyzer of Estonian’];

Pilvik-Creating a corpus of communal court minute books-247_a.pdf

5:15pm - 5:30pm
Distinguished Short Paper (10+5min) [publication ready]

FSvReader – Exploring Old Swedish Cultural Heritage Texts

Yvonne Adesam, Malin Ahlberg, Gerlof Bouma

University of Gothenburg,

This paper describes FSvReader, a tool for easier access to Old Swedish (13th–16th century) texts. Through automatic fuzzy linking of words in a text to a dictionary describing the language of the time, the reader has direct access to dictionary pop-up definitions, in spite of the large amount of graphical and morphological variation. The linked dictionary entries can also be used for simple searches in the text, highlighting possible further instances of the same entry.

Adesam-FSvReader – Exploring Old Swedish Cultural Heritage Texts-199_a.pdf
4:00pm - 5:30pmW-PIII-1: Computational Linguistics 1
Session Chair: Lars Borin
4:00pm - 4:30pm
Long Paper (20+10min) [abstract]

Dialects of Discord. Using word embeddings to analyze preferred vocabularies in a political debate: nuclear weapons in the Netherlands 1970-1990

Ralf Futselaar, Milan van Lange

NIOD, Institute for War-, Holocaust-, and Genocide Studies

We analyze the debate about the placement of nuclear-enabled cruise missiles in the Netherlands during the 1970s and 1980s. The NATO “double-track decision” of 1979 envisioned the placement of these weapons in the Netherlands, to which the Dutch government eventually agreed in 1985. In the early 1980s, the controversy regarding placement or non-placement of these missiles led to the greatest popular protests in Dutch history and to a long and often bitter political controversy. After 1985, due to declining tensions between the Societ Block and NATO, the cruise missiles were never stationed in the Netherlands. Much older nuclear warheads, in the country since the early 1960s, remain there until today.

We are using word embeddings to analyze this particularly bipolar debate in the proceedings of the Dutch lower and upper house of Parliament. The official political positions, as expressed in party manifestos and voting behavior inside parliament, were stable throughout this period. We demonstrate that in spite of this apparent stability, the vocabularies used by representatives of different political parties changed significantly through time.

Using the word2vec algorithm, we have created a combined vector including all synonyms and near-synonyms of “nuclear weapon” used in the proceedings of both houses of parliament during the period under scrutiny. Based on this combined vector, and again using word2vec, we have identified nearest neighbors of words used to describe nuclear weapons. These terms have been manually classified, insofar relevant, into terms associated with a pro-proliferation or anti-proliferation viewpoint, for example “defense” and “disarmament” respectively.

Obviously, representatives of all Dutch political parties used words from both categories in parliamentary debates. At any given time, however, we demonstrate that different political parties could be shown to have clear preferences in terms of vocabulary. In the “discursive space” created by the binary opposition between pro- and contra-proliferation words, political parties can be shown to have specific and distinct ways of discussing nuclear weapons.

Using this framework, we have analyzed the changing vocabularies of different political parties. This allows us to show that, while stated policy positions and voting behavior remained unchanged, the language used to discuss nuclear weapons shifted strongly towards anti-proliferation terminology. We have also been able to show that this change happened at different times for different political parties. We speculate that these changes resulted from perceived changes of opinion among the target electorates of different parties, as well as the changing geopolitical climate of the mid-to-late 1980s, where nuclear non-proliferation became a more widely shared policy objective.

In the conclusion of this paper, we show that word embedding models offer a, methodology to investigate shifting political attitudes outside of, and in addition to, stated opinions and voting patterns.

Futselaar-Dialects of Discord Using word embeddings to analyze preferred vocabularies-112_a.pdf
Futselaar-Dialects of Discord Using word embeddings to analyze preferred vocabularies-112_c.pdf

4:30pm - 4:45pm
Distinguished Short Paper (10+5min) [publication ready]

Emerging Language Spaces Learned From Massively Multilingual Corpora

Jörg Tiedemann

University of Helsinki,

Translations capture important information about languages that can be used as implicit supervision in learning linguistic properties and semantic representations. Translated texts are semantic mirrors of the original text and the significant variations that we can observe across languages can be used to disambiguate the meaning of a given expression using the linguistic signal that is grounded in translation. Parallel corpora consisting of massive amounts of human translations with a large linguistic variation can be used to increase abstractions and we propose the use of highly multilingual machine translation models to find language-independent meaning representations. Our initial experiments show that neural machine translation models can indeed learn in such a setup and we can show that the learning algorithm picks up information about the relation between languages in order to optimize transfer leaning with shared parameters. The model creates a continuous language space that represents relationships in terms of geometric distances, which we can visualize to illustrate how languages cluster according to language families and groups. With this, we can see a development in the direction of data-driven typology -- a promising approach to empirical cross-linguistic research in the future.

Tiedemann-Emerging Language Spaces Learned From Massively Multilingual Corpora-176_a.pdf
Tiedemann-Emerging Language Spaces Learned From Massively Multilingual Corpora-176_c.pdf

4:45pm - 5:15pm
Long Paper (20+10min) [publication ready]

Digital cultural heritage and revitalization of endangered Finno-Ugric languages

Anisia Katinskaia, Roman Yangarber

University of Helsinki, Department of Computer Science

Preservation of linguistic diversity has long been recognized as a crucial, integral part of supporting our cultural heritage. Yet many ”minority” languages—lacking state official status—are in decline, many severely endangered. We present a prototype system aimed at ”heritage” speakers of endangered Finno-Ugric languages. Heritage speakers are people who have heard the language used by the older generations while they were growing up, and possess a considerable passive

competency (well beyond the ”beginner” level), but are lacking in active fluency. Our system is based on natural language processing and artificial intelligence. It assists the learners by allowing them to use arbitrary texts of their choice, and by creating exercises that require them to engage in active production of language—rather than in passive memorization of material. Continuous automatic assessment helps guide the learner toward improved fluency. We believe that providing such AI-based tools will help bring these languages to the forefront of the modern digital age, raise prestige, and encourage the younger generations to become involved in reversal of decline.

Katinskaia-Digital cultural heritage and revitalization of endangered Finno-Ugric languages-228_a.pdf

5:15pm - 5:30pm
Short Paper (10+5min) [publication ready]

The Fractal Structure of Language: Digital Automatic Phonetic Analysis

William A Kretzschmar Jr

University of Georgia,

In previous study of the Linguistic Atlas data from the Middle and South Atlantic States (e.g. Kretzschmar 2009, 2015), it has been shown that the frequency profiles of variant lexical responses to the same cue are all patterned in nonlinear A-curves. Moreover, these frequency profiles are scale-free, in that the same A-curve patterns occur at every level of scale. In this paper, I will present results from a new study of Southern American English that, when completed, will include over one million vowel measurements from interviews with a sample of sixty-four speakers across the South. Our digital methods, adaptation of the DARLA and FAVE tools for forced alignment and automatic formant extraction, prove that speech outside of the laboratory or controlled settings can be processed by automatic means on a large scale. Measurements in F1/F2 space are analyzed using point-pattern analysis, a technique for spatial data, which allows for creation and comparison of results without assumptions of central tendency. This Big Data resource allows us to see the fractal structure of language more completely. Not only do A-curve patterns describe the frequency profiles of lexical and IPA tokens, but they also describe the distribution of measurements of vowels in F1/F2 space, for groups of speakers, for individual speakers, and even for separate environments in which vowels occur. These findings are highly significant for how linguists make generalizations about phonetic data. They challenge the boundaries that linguists have traditionally drawn, whether geographic, social, or phonological, and demand that we use a new model for understanding language variation.

Kretzschmar Jr-The Fractal Structure of Language-108_a.pdf
Kretzschmar Jr-The Fractal Structure of Language-108_c.pdf
4:00pm - 5:30pmW-PIV-1: Infrastructure and Support
Session Chair: Tanja Säily
4:00pm - 4:30pm
Long Paper (20+10min) [publication ready]

Towards an Open Science Infrastructure for the Digital Humanities: The Case of CLARIN

Koenraad De Smedt1, Franciska De Jong2, Bente Maegaard3, Darja Fišer4, Dieter Van Uytvanck2

1University of Bergen, Norway; 2CLARIN ERIC, The Netherlands; 3University of Copenhagen, Denmark; 4University of Ljubljana and Jožef Stefan Institute, Slovenia

CLARIN is the European research infrastructure for language resources. It is a sustainable home for digital research data in the humanities and it also of-fers tools and services for annotation, analysis and modeling. The scope and structure of CLARIN enable a wide range of studies and approaches, in-cluding comparative studies across regions, periods, languages and cul-tures. CLARIN does not see itself as a stand-alone facility, but rather as a player in making the vision that is underlying the emerging European poli-cies towards Open Science a reality, interconnecting researchers across na-tional and discipline borders by offering seamless access to data and ser-vices in line with the FAIR data principles. CLARIN also aims contribute to responsible data science by the design as well as the governance of its in-frastructure and to achieve an appropriate and transparent division of re-sponsibilities between data providers, technical centres, and end users. CLARIN offers training towards digital scholarship for humanities scholars and aims at increased uptake from this audience.

De Smedt-Towards an Open Science Infrastructure for the Digital Humanities-249_a.pdf
De Smedt-Towards an Open Science Infrastructure for the Digital Humanities-249_c.pdf

4:30pm - 4:45pm
Short Paper (10+5min) [abstract]

The big challenge of data! Managing digital resources and infrastructures for digital humanities researchers

Isto Huvila

Uppsala University,

Digital humanities research is dependent on the development and seizing of appropriate digital methods and technologies, collection and digitisation of data, and development of relevant and practicable research questions. In the long run, the potential of the field to sustain as a significant social intellectual movement (or in Kuhnian terms, paradigm) is, however, conditional to the sustainability of the scholarly practices in the field. Digital humanities research has already moved from early methodological experiments to the systematic development of research infrastructures. These efforts are based both on the explicit needs to develop new resources for digital humanities research and on the strategic initiatives of the keepers of relevant existing collections and datasets to open up their holdings for users. Harmonisation and interoperability of the evolving infrastructures are in different stages of developments both nationally and internationally but in spite of the large number of practical difficulties, the various national, European (e.g. DARIAH, CLARIN and ARIADNE) and international initiatives are making progress in this respect. The sustainability of digital infrastructures is another issue that has been scrutinised and addressed both in theory and practice under the auspices of national data archives, specialist organisations like the British Digital Curation Centre and international discussions, for instance, within the iPRES conference community. However, an aspect of the management of the infrastructures that has received relatively little attention so far, is management for use. We are lacking a comprehensive understanding of how the emerging digital data and infrastructures are used, could be used and consequently, how the emanating resources should be managed to be useful for digital humanities research not only in the context within which they were developed but also for other researchers and many cases users outside of the academia.

This paper discusses the processes and competences for the management of digital humanities resources and infrastructures for (theoretically) maximising their current and future usefulness for the purposes of research. On the basis of empirical work on archaeological research data in the context of the Swedish Archaeological Information in the Digital Society (ARKDIS) research project (Huvila, 2014) and a comparative study with selected digital infrastructures in other branches of humanities research, a model of use-oriented management of research data with central processes and competences is presented. The suggested approach complements existing digital curation and management models by opening up the user side processes of digital humanities data resources and their implications for the functioning, development and management of appropriate research infrastructures. Theoretically the approach draws from the records continuum theory (as formulated by Upward and colleagues (e.g. Upward, 1996, 1997, 2000; McKemmish, 2001)) and Pickering’s notion of the mangle of practice (Pickering, 1995) developed in the context of the social studies of science. The model demonstrates the significance of being sensitive to explicit wants and needs of the researchers (users) but also the implicit, often tacit requirements that emerge from their practical research work. Simultaneously, the findings emphasise the need of a meta-competence to manage the data and provide appropriate services for its users.


Huvila, I. (Ed.) (2014). Perspectives to Archaeological Information in the Digital Society. Uppsala: Department of ALM, Uppsala University.


McKemmish, S. (2001). Placing Records Continuum Theory and Practice. Archival Science, 1(4), 333–359.


Pickering, A. (1995). The Mangle of Practice: Time, Agency, and Science. Chicago: University of Chicago Press.

Upward, F. (1996). Structuring the Records Continuum Part One: Postcustodial Principles and Properties. Archives and Manuscripts, 24(2), 268– 285.

Upward, F. (1997). Structuring the Records Continuum, Part Two: Structuration Theory and Recordkeeping. Archives and Manuscripts, 25(1), 10–35.

Upward, F. (2000). Modelling the continuum as paradigm shift in recordkeeping and archiving processes, and beyond–a personal reflection. Records Management Journal, 10(3), 115–139.

Huvila-The big challenge of data! Managing digital resources and infrastructures-104_a.pdf
Huvila-The big challenge of data! Managing digital resources and infrastructures-104_c.pdf

4:45pm - 5:00pm
Short Paper (10+5min) [abstract]

Research in Nordic literary collections: What is possible and what is relevant?

Mads Rosendahl Thomsen1, Kristoffer Laigaard Nielbo2, Mats Malm3

1Aarhus University; 2University of Southern Denmark; 3University of Gothenburg

There are a growing number of digital literary collections in the Nordic countries that make the literary heritage accessible and have great potential for research that takes advantage of machine readable texts. These collections range from very large collections such as the Norwegian Bokhylla, medium-sized collections such as the Swedish Litteraturbanken and the Danish Arkiv for Dansk Litteratur, to one-author collections, e.g. the collected works of N.F.S. Grundtvig. In this presentation we will discuss some of the obstacles for a more widespread use of these collections by literary scholars and present outcomes of a series of seminars – UCLA 2015, Aarhus 2016, UCLA 2017 – sponsored by the Fondation Maison des sciences de l’homme courtesy of a grant from the Andrew Carnegie Mellon Foundation.

We find that there are two important thresholds in the use of collections:

1) The technical obstacles for collecting the right corpora and applying the appropriate tools for analysis are too high for the majority of researchers working in literary studies. While much have been done to advance the access to works, differences in formats and metadata make it difficult to work across the collections. Our project has addressed this issue by creating a Nordic github repository for literary texts, CLEAR, which provides cleaned versions of Nordic literary works, as well as a suite of tools in Python.

2) The capacity to combine traditional hermeneutical approaches to literary studies with computational approaches is still in its infancy despite numerous good studies from the past years, e.g. by Stanford Literary Lab, Leonard and Tangherlini and Ted Underwood. We have worked to bring together in our series of seminar scholars with great technical prowess and more traditionally trained literary scholars in a series of seminars to generate projects that are technically feasible and scholarly relevant. The process of expanding the methodological vocabulary of literary studies is complicated and requires significant domain expertise to verify the outcome of computational analyses, and conversely, openness to work with results that cannot be verified by close readings. In this presentation we will show how thematic variation and readability can provide new perspectives on Swedish and Danish modernist literature, and discuss how this relates to more general visions of literary studies in an age of computation (Heise, Thomsen).


Algree-Hewitt, Mark et al. 2016. ”Canon/Archive. Large-scale Dynamics in the Literary Field.” Stanford Literary Lab Pamphlet 11.

Heise, Ursula. 2017. “Comparative literature and computational criticism: A conversation with Franco Moretti.” Futures of Comparative Literature: ACLA State of the Discipline Report. London: Routledge, 2017.

Leonard, Peter and Timothy R. Tangherlini. 2013. “Trawling in the Sea of the Great Unread: Sub-Corpus Topic Modeling and Humanities Research”. Poetics 41(6): 725-749.

Thomsen, Mads Rosendahl et al. 2015. “No Future without Humanities.” Humanities 1.

Underwood, Ted. 2013. Why Literary Period Mattered. Stanford: Stanford University Press.

Thomsen-Research in Nordic literary collections-133_a.pdf

5:00pm - 5:30pm
Long Paper (20+10min) [publication ready]

Reassembling the Republic of Letters - A Linked Data Approach

Jouni Tuominen1,2, Eetu Mäkelä1,2, Eero Hyvönen1,2, Arno Bosse3, Miranda Lewis3, Howard Hotson3

1Aalto University, Semantic Computing Research Group (SeCo); 2University of Helsinki, HELDIG – Helsinki Centre for Digital Humanities; 3University of Oxford, Faculty of History

Between 1500 and 1800, a revolution in postal communication allowed ordinary men and women to scatter letters across and beyond Europe. This exchange helped knit together what contemporaries called the respublica litteraria, Republic of Letters, a knowledge-based civil society, crucial to that era’s intellectual breakthroughs, and formative of many modern European values and institutions. To enable effective Digital Humanities research on the epistolary data distributed in different countries and collections, metadata about the letters have been aggregated, harmonised, and provided for the research community through the Early Modern Letters Online (EMLO) service. This paper discusses the idea and benefits of using Linked Data as a basis for the next digital framework of EMLO, and presents experiences of a first demonstrational implementation of such a system.

Tuominen-Reassembling the Republic of Letters-207_a.pdf
Tuominen-Reassembling the Republic of Letters-207_c.pdf
6:00pm - 8:00pmJoint reception with Nordic Challenges conference
Main Building of the University, entrance from the Senate Square side
Date: Thursday, 08/Mar/2018
8:00am - 9:00amBreakfast
Lobby, Porthania
9:00am - 10:30amPlenary 2: Kathryn Eccles
Session Chair: Eero Hyvönen
Finding the Human in Data: What can Digital Humanities learn from digital transformations in cultural heritage?
10:30am - 11:00amCoffee break
Lobby, Porthania
11:00am - 12:30pmT-PII-1: Our Digital World
Session Chair: Leo Lahti
11:00am - 11:15am
Short Paper (10+5min) [publication ready]

The unchallenged persuasions of mobile media technology: The pre-domestication of Google Glass in the Finnish press

Minna Saariketo

Aalto University,

In recent years, networked devices have taken an ever tighter hold of

people’s everyday lives. The tech companies are frantically competing to grab

people’s attention and secure a place in their daily routines. In this short paper, I

elaborat further a key finding from an analysis of Finnish press coverage on

Google Glass between 2012 and 2015. The concept of pre-domestication is used

to discuss the ways in which we are invited and persuaded by the media discourse

to integrate ourselves in the carefully orchestrated digital environment. It is

shown how the news coverage deprives potential new users of digital technology

a chance to evaluate the underpinnings of the device, the attachments to data harvesting, and the practices of hooking attention. In the paper, the implications of

contemporary computational imaginaries as (re)produced and circulated in the

mainstream media are reflected, thereby shedding light on and opening possibilities to criticize the politics of mediated pre-domestication.

Saariketo-The unchallenged persuasions of mobile media technology-262_a.pdf

11:15am - 11:30am
Distinguished Short Paper (10+5min) [publication ready]

Research of Reading Practices and ’the Digital’

Anna Kaisa Kajander

University of Helsinki,

Books and reading habits belong to one of the areas of our everyday lives that have strongly been affected by digitalisation. The subject has been lifted repeatedly to public discussions in Finnish mainstream media, and the typical discourse is focused on e-books and printed books, sometimes still in a manner which juxtaposes the formats. Another aspect of reading that has gained publicity recently, concerns the decreasing interest towards books in general. The acceptance of e-books and the status of printed books in contemporary reading have raised questions, but it has also been realised that the recent changes are connected with digitalisation in a wider cultural context. It has enabled new forms of reading and related habits, which benefit readers and book culture, but it has also affected free time activities that do not support interest towards books.

In this paper, my aim is to discuss the research of books and reading as a socio-cultural practice, and ask if this field could benefit from co-operation with digital humanities scholars. The idea of combining digital humanities with book research is not new; collaboration has been welcomed especially in research that focuses on new technologies of books and the use of digitised historical records, such as bibliographies. However, I would like to call for discussion on how digital humanities could benefit the research of (new) reading practices and the ordinary reader. Defining ‘the digital’ would be essential, as well as knowledge of relevant methodologies, tools and data. I will first introduce my ongoing PhD-project and present some questions that I have had during the process. Then, based on the questions, I’d like to discuss what kind of co-operation between digital humanities and reading research could be useful to help gain knowledge of the change in book reading, related habits and contemporary readership.

PhD-project Life as a Reader

In my ongoing dissertation project, I am focusing on attitudes and expectations towards printed and e-books and new reading practices. The research material I am using consists of approximately 540 writings that were sent to the Finnish Literature Society in a public collection called “Life as a reader” in 2014. This collection was organised by Society’s Literary- and Traditional archives in co-operation with the Finnish Bookhistorical Society, and the aim was to focus on reading as a socio-cultural practice. The organisers wanted people to write in their own words about reading memories. They also included questions in the collection call, which handled, for example, topics about childhood and learning to read, reading as a private or shared practice, places and situations of reading, and experiences about recent changes, such as opinions about e-books and virtual book groups. Book historical interests were visible in the project, as all of the questions mentioned above had been apparent also in other book history research; interests towards the ordinary readers and their every day lives, the ways readers find and consume texts and readership in the digital age.

In the dissertation I will focus on the writings and especially on those writers, who liked to read books for pleasure and as a free time activity. The point is to emphasise the readers point of view to the recent changes. I argue that if we want to understand attitudes towards reading or the possible futures of reading habits, we need to understand the different practices, which the readers themselves attach to their readership. The main focus is on attitudes and expectations towards books as objects, but equally important is to scrutinise other digitally-related changes that have affected their reading practices. I am analysing these writings focusing especially to the different roles books as objects play in readers lives and to the attitudes towards digitalisation as a cultural change. The ideas behind the research questions have been based on my background as an ethnologist interested in material culture studies. I believe the concept of materiality and research of reading as a sensory experience are important in understanding of attitudes towards different book formats, readers choices and wishes towards the development of books.

Aspects of readership

The research material turned out to be rich in different viewpoints towards reading. As I expected, people wrote about their feelings about the different book formats and their reasons for choosing them. However, during the process of analysis, it become clear that to find answers to questions about the meanings of materialities of books, knowledge about the different aspects of reading habits, that the writers themselves connected to their identities as readers, was also needed. This meant focusing on writings about practices that reached further than only to book formats and reading moments. I am now in the phase of analysing the ways in which people, for example, searched for and found books, collected them and discussed literature with other readers. These activities were often connected to social media, digital book stores and libraries, values of (e-)books as objects and affects of different formats to the practices. What also became clear was that other free time activities and use of media affected to the amount of time used for reading, also for those writers that were interested in books and liked to read.

As the material was collected at the time when smartphones and tablets, which are generally considered having made an essential impact to reading habits, had only quite recently become popular and well known objects, the writings were often focused on the change and on uncertain futures of books. The mentioned practices were connected to concepts such as ownership, visibility and representation. As digital texts had changed the ways these aspects were understood, they also seemed to have caused negative associations towards digitalisation of books, especially among readers who saw the different aspects of print culture as positive parts of their readership. However, there were also friends of printed books who saw digital services as something very positive; as things that supported their reading habits. Writings about, for example, finding books to read or discussing literature with other readers online, writing and publishing book reviews in blogs or being active in GoodReads or Book Crossing websites were all seen as welcomed aspects of “new” readership. A small minority of the writers also wrote about fanfiction and electronic literature.

To compare the time of material collection with the present day, digital book services, such as e-book and audiobook services, have been gaining popularity, but the situation is not radically different from 2014. E-books have perhaps become better known since then, but they still are marginal in comparison with printed books. This means that they have not gained popularity as was expected in previous years. To my knowledge, the other aspects of new reading practices, such as the meanings of social media or the interests towards electronic literature have not yet been studied much in the Finnish context. These observations lead to the questions of the possible benefits of digital humanities for book and reading research.

Collaborating with digital humanists?

The changes in books and reading cause worries but also hopes and interest towards the future reading habits. To gain and produce knowledge about the change, we need to define what are ‘the digital’ or ‘digitalisation’, that are so often referred to in different contexts without any specific definitions. The problem is that they can mean and include various things that are attached to both technological and cultural sides of the phenomenon. For those interested in reading research, it would be important to theorise digitalisation from perspectives of readers and view the changes in reading from socio-cultural side; as concrete changes in material environment and as new possibilities to act as a reader. This field of research would benefit from collaboration with digital humanists who have knowledge about ‘the digital’ and the possibilities of reading related software and devices.

Secondly we could benefit from discussions about the possibilities to collect, use and save data that readers now leave behind, as they practice readership in digital environments. Digital book stores, library services and social media sites would be useful sources, but more knowledge is still needed about the nature of these kinds of data; which aspects affect the data, how to get the data, which tools use, etc.. Questions about collecting and saving data also include important questions related to research ethics, that also should be further discussed in book research; which data should be open and free to use, who owns the data, which permissions would be required to study certain websites? Changes in free time activities in general have also raised questions about data that could be used for comparing the time used different activities and on the other hand on reading habits.

Thirdly collaboration is needed when reading related databases are being developed. Some steps have already been taken, for example in the project Finnish Reading Experience Database, but these kinds of projects could be also further developed. Again collecting digital data but also opening and using them for different kinds of research questions is needed. At its best, multidisciplinary collaboration could help building new perspectives and research questions about the contemporary readership, and therefore all discussion and ideas that could benefit the field of books and reading would be welcome.

Kajander-Research of Reading Practices and ’the Digital’-216_a.pdf

11:30am - 11:45am
Short Paper (10+5min) [publication ready]

Exploring Library Loan Data for Modelling the Reading Culture: project LibDat

Mats Neovius1, Kati Launis2, Olli Nurmi3

1bo Akademi University; 2University of Eastern Finland; 3VTT research center

Reading is evidently a part of the cultural heritage. With respect to nourishing this, Finland is exceptional in the sense it has a unique library system, used regularly by 80% of the population. The Finnish library system is publicly funded free-of-charge. On this, the consortium “LibDat: Towards a More Advanced Loaning and Reading Culture and its Information Service” (2017-2021, Academy of Finland) set out to explore the loaning and reading culture and its information service to the end that this project’s results would help the officials to elaborate upon Finnish public library services. The project is part of the constantly growing field of Digital Humanities and wishes to show how large born-digital material, new computational methods and literary-sociological research questions can be integrated into the study of contemporary literary culture. The project’s collaborator Vantaa City Library collect the daily loan data. This loan data is objective, crisp, and big. In this position paper, the main contribution is a discussion on limitations the data poses and the literary questions that may be shed light on by computational means. For this, we de-scribe the data structure of a loan event and outline the dimensions in how to in-terpret the data. Finally, we outline the milestones of the project.

Neovius-Exploring Library Loan Data for Modelling the Reading Culture-208_a.pdf

11:45am - 12:00pm
Short Paper (10+5min) [publication ready]

Virtual Museums and Cultural Heritage: Challenges and Solutions

Nadezhda Povroznik

Perm State National Research University, Center for Digital Humanities

The paper is devoted to demonstrate the significance of virtual museums’ study, to define more exactly the term “virtual museum” and its content, to show the problems of existing virtual museums and those complexities, which they represent for the study of cultural heritage, to show the problems of usage of virtual museum content in classical researches, which are connected with the specificity of virtual museums as informational resources and to demonstrate possible decisions of problems, sorting out all possible ways of the most effective usage of Cultural Heritage in humanities researches. The study pays attention to the main problems, related to the preservation, documentation, representation and use of CH associated with the virtual museums. It provides the basis for solving these problems, based on the subsequent development of an information system for the study of virtual museums and their wider use.

Povroznik-Virtual Museums and Cultural Heritage-214_a.pdf
Povroznik-Virtual Museums and Cultural Heritage-214_c.pdf

12:00pm - 12:15pm
Short Paper (10+5min) [abstract]

The Future of Narrative Theory in the Digital Age?

Hanna-Riikka Roine

University of Helsinki

As it has often been noted, digital humanities are to be understood in plural. It seems, however, that quite as often they are understood as the practice of introducing digital methods to humanities, or a way to analyse “the digital” within the humanist framework. This presentation takes a slightly different approach, as its aim is to challenge some of the traditional theoretical concepts within a humanist field, narrative theory, through the properties of today’s computational environment.

Narrative theory has originated from literary criticism and based its concepts and understanding of narrative in media on printed works. While few trends with a more broadly defined base are emerging (e.g. the project of “transmedial narratology”), the analysis of verbal narrative structures and strategies from the perspective of literary theory remains the primary concern of the field (see Kuhn & Thon 2017). Furthermore, the focus of current research is mostly medium-specific, while various phenomena studied by narratology (e.g. narrativity, worldbuilding) are agreed to be medium-independent.

My presentation starts from the fact that the ancient technology of storytelling has become enmeshed in a software-driven environment which not only has the potential to simulate or “transmediate” all artistic media, but also differs fundamentally from verbal language in its structure and strategies. This development or “digital turn” has so far mostly escaped the attention of narratologists, although it has had profound effects on the affordances and environments of storytelling.

In my presentation, I take up the properties of computational media that challenge the print-based bias of current narrative theory. As a starting point, I suggest that the scope of narrative theory should be extended to the machines of digital media instead of looking at their surface (cf. Wardrip-Fruin 2009). As software-driven, conditional, and process-based, storytelling in computational environments is not so much about disseminating a single story, but rather about multiplication of narrative, centering upon the underlying patterns on which varied instantiations can be based. Furthermore, they challenge the previous theoretical emphasis on fixed media content and author-controlled model of transmission. (See e.g. Murray 1997 and 2011; Bogost 2007, Hayles 2012, Manovich 2013, Jenkins et al. 2013.)

Because computational environments represent “a new norm” compared to the prototypical narrative developed in the study of literary fiction, Brian McHale has recently predicted that narrative theory “might become divergent and various, multiple narratologies instead of one – a separate narratology for each medium and intermedium” (2016, original emphasis). In my view, such a future fragmentation of the field would only diminish the potential of narrative theory. Instead, the various theories could converge or hybridize in a similar way that contemporary media has done – especially in the study of today’s transmedia which is hybridizing both in the sense of content being spread across media and in the sense of media being incorporated by computer and thus, acquiring the properties of computational environments.

The consequences of the recognition of media convergence or hybridization in narrative theory are not only (meta)theoretical. The primary emphasis on media content is still clearly visible in the division of modern academic study of culture and its disciplines – literary studies focus on literature, for example. While the least that narrative theory can do is expanding “potential areas of cross-pollination” (Kuhn & Thon 2017) with media studies, for example, and challenging the print-based assumptions behind concepts such as narrativity or storyworld, there may also be a need to affect some changes in the working methods of narratologists. Creating multidisciplinary research groups focusing on narrative and storytelling in current computational media is one solution (still somewhat unusual in the “traditional” humanities focused on single-authored articles and monographs), while the other is critically reviewing the academic curricula. N. Katherine Hayles, for example, has “Comparative Media Studies approach” (2012) to describe transformed disciplinary coherence that literary studies might embrace.

In my view, narrative theory can truly be “transmedial” and contribute to the study of storytelling practices and strategies in contemporary computational media, but various print- and content-based biases underlying its toolkit must be genuinely addressed first. The need for this is urgent not only because “narratives are everywhere”, but also because the old traditional online/offline distinction has begun to disappear.


Bogost, Ian. 2007. Persuasive Games: The Expressive Power of Videogames. Cambridge, Ma: The MIT Press.

Hayles, N. Katherine. 2012. How We Think: Digital Media and Contemporary Technogenesis. Chicago: Univ. of Chicago Press.

Jenkins, Henry, Sam Ford, and Joshua Green. 2013. Spreadable Media: Creating Value and Meaning in a Networked Culture. New York: New York Univ. Press.

Kuhn, Markus and Jan-Noël Thon. “Guest Editors’ Column. Transmedial Narratology: Current Approaches.” NARRATIVE 25:3 (2017): 253–255.

Manovich, Lev. 2013. Software Takes Command: Extending the Language of New Media. New York and London: Bloomsbury.

McHale, Brian. “Afterword: A New Normal?” In Narrative Theory, Literature, and New Media: Narrative Minds and Virtual Worlds, edited by Mari Hatavara, Matti Hyvärinen, Maria Mäkelä, and Frans Mäyrä, 295–304. London: Routledge, 2016.

Murray, Janet. 1997. Hamlet on the Holodeck: The Future of Narrative in Cyberspace. New York: The Free Press.

―――. 2011. Inventing the Medium. Principles of Interaction Design as a Cultural Practice. Cambridge, Ma: The MIT Press.

Wardrip-Fruin, Noah. 2009. Expressive Processing: Digital Fictions, Computer Games, and Software Studies. Cambridge, Ma. and London: The MIT Press.

Roine-The Future of Narrative Theory in the Digital Age-137_a.pdf
Roine-The Future of Narrative Theory in the Digital Age-137_c.pdf

12:15pm - 12:30pm
Short Paper (10+5min) [abstract]

Broken data and repair work

Minna Ruckenstein

Consumer Society Research Centre, University of Helsinki, Finland,

Recent research introduces a concept-metaphor of “broken data”, suggesting that digital data might be broken and fail to perform, or be in need of repair (Pink et al, forthcoming). Concept-metaphors, anthropologist Henrietta Moore (1999, 16; see also Moore 2004) argues, are domain terms that “open up spaces in which their meanings – in daily practice, in local discourses and in academic theorizing – can be interrogated”. By doing so, concept-metaphors become defined in practice and in context; they are not meant to be foundational concepts, but they work as partial and perspectival framing devices. The aim of a concept-metaphor is to arrange and provoke ideas and act as a domain within which facts, connections and relationships are presented and imagined.

In this paper, the concept-metaphor of broken data is discussed in relation to the open data initiative, Citizen Mindscapes, an interdisciplinary project that contextualizes and explores a Finnish-language social media data set (‘Suomi24’, or Finland24 in English), consisting of tens of millions of messages and covering social media over a time span of 15 years (see, Lagus et al 2016). The role of the broken data metaphor in this discussion is to examine the implications of breakages and consequent repair work in data-driven initiatives that take advantage of secondary data. Moreover, the concept-metaphor can sensitize us to consider the less secure and ambivalent aspects of data worlds. By focusing on how data might be broken, we can highlight misalignments between people, devices and data infrastructures, or bring to the fore the failures to align data sources or uses with the everyday.

As Pink et al (forthcoming) suggest the metaphorical understanding of digital data, aiming to underline aspects of data brokenness, brings together various strands of scholarly work, highlighting important continuities with earlier research. Studies of material culture explore practices of breakage and repair in relation to the materiality of objects, for instance by focusing on art restoration (Dominguez Rubio 2016), or car repair (Dant 2010). Drawing attention to the fragility of objects and temporal decay, these studies underline that objects break and have to be mended and restored. When these insights are brought into the field of data studies, the materiality of platforms and software and subsequent data arrangements, including material restrictions and breakages, become a concern (Dourish 2016; Tanweer et al 2016), emphasizing aspects of brokenness and following repair work in relation to digital data (Pink et al, forthcoming).

In the science and technology studies (STS), on the other hand, the focus on ‘breakages’ has been studied in relation to infrastructures, demonstrating that it is through instances of breakdown that structures and objects, which have become invisible to us in the everyday, gain a new kind of visibility. The STS scholar Stephen Jackson expands the notion of brokenness further to more everyday situations and asks ‘what happens when we take erosion, breakdown, and decay, rather than novelty, growth, and progress, as our starting points in thinking through the nature, use, and effects of information technology and new media?’ (2014: 174). Instances of data breakages can be seen in light of mundane data arrangements, as a recurring feature of data work rather than an exceptional event (Pink et al, forthcoming; Tanweer et al 2016).

In order to concretize further the usefulness of the concept-metaphor of broken data, I will detail instances of breakage and repair in the data work of the Citizen Mindscapes initiative, emphasizing efforts needed to overcome various challenges in working with large digital data. This kind of approach introduces obstacles and barriers that slow or derail the data science process as an important resource for knowledge production and innovation (Tanweer et al 2016). In the collaborative Citizen Mindscapes initiative, discussing the gaps, or possible anomalies in the data led to conversations concerning the production of data, deepening our understanding of the human and material factors at play in processes of data generation.

Identifying data breakages

The Suomi24 data was generated by a media company, Aller. The data set grew on the company servers for over a decade, gaining a new life and purpose when the company decided to open the proprietary data for research purposes. A new infrastructure was needed for hosting and distributing the data. One such data infrastructure was already in place, the Language Bank of Finland, maintained by CSC (IT Centre for Science), developed for acquiring, storing, offering and maintaining linguistic resources, tools and data sets for academic researchers. The Language Bank gave a material structure to the Suomi24 data: it was repurposed as research data for linguistics.

The Korp tool, developed for the analysis of data sets stored in the Language Bank, allowed word searches, in relation to individual sentences, retaining the Suomi24 data as a resource for linguistic research. Yet, the material arrangements constrained other possible uses of the data that were of interest to the Citizen Mindscapes research collective, aiming to work the data to accommodate the social science focus on topical patterns and emotional waves and rhythms characteristic of the social media. In the past two years, the research collective, particularly those members experienced in working with large data sets, have been repairing and cleaning the data in order to make it ready for additional computational approaches. The goal is to build a methodological toolbox that researchers, who do not possess computational skills, but are interested in using digital methods in the social scientific inquiry, can benefit from. This entails, for instance, developing user interfaces that narrow down the huge data set and allow to access data with topic-led perspectives.

The ongoing work has alerted us to breakages of data, raising more general questions about the origins and nature of data. Social media data, such as the Suomi24, is never an accurate, or complete representation of the society. From the societal perspective, the data is broken, offering discontinuous, partial and interrupted views to individual, social and societal aims. The preparation of data for research that takes societal brokenness seriously underlines the importance of understanding the limitations and biases in the production of the data, including insights into how the data might be broken. The first step towards this aim was a research report (Lagus et al 2016) that evaluated and contextualized the Suomi24 data in a wide variety of ways. We paid attention to the writers of the social media community as producers of the data; the moderation practices of the company were described to demonstrate how they shape the data set by filtering certain swearwords and racist terms, or certain kinds of messages, for instance, advertisement or messages containing personal information.

The yearly volume and daily rhythms of the data were calculated based on timestamps, and the topical hierarchies of the data were uncovered by attention to the conversational structure of the social media forum. When our work identified gaps, errors and anomalies in the data, it revealed that data might be broken and discontinuous due to human or technological forces: infrastructure failures, trolling, or automated spam bots. With the information of gaps in the data, we opened a conversation with the social media company’s employees and learned that nobody could tell us about the 2004-2005 gap in the data. A crack in the organizational memory was revealed, reminding of the links between the temporality of data and human memory. In contrast, the anomaly in the data volume in July 2009 which we first suspected was a day when something dramatic happened that created a turmoil in the social media, turned out to be a spam bot, remembered very well in the company.

In the field of statistics, for instance, research might require intimate knowledge of all possible anomalies of the data. What appears as incomplete, inconsistent and broken to some practitioners might be irrelevant for others, or a research opportunity. The role of the concept-metaphor of broken data is to open a space for discussion about these differences, maintaining them, rather than resolving them. One option is to highlight how data is seen as broken in different contexts and compare the breakages, and then follow what happens after them, and focus on the repair and cleaning work

Concluding remarks

The purpose of this paper has been to introduce the broken data metaphor that calls for paying more attention to the incomplete and fractured character of digital data. Acknowledging the incomplete nature of data in itself is of course nothing new, researcher are well aware of their data lacking perfection. With growing uses of secondary data, however, the ways in which data is broken might not be known beforehand, underlining the need to pay more careful attention to brokenness and the consequent work of repair. In the case of Suomi24data, the data breakages suggest that we need to actively question data production and the diverse ways in which data are adapted for different ends by practitioners. As described above, the repurposed data requires an infrastructure, servers and cloud storage; the software and analytics tools enable certain perspectives and operations and disable others, Data is always inferred and interpreted in infrastructure and database design and by professionals, who see the data, and its possibilities, differently depending on their training. As Genevieve Bell (2015: 16) argues, the work of coding data and writing algorithms determines ‘what kind of relationships there should be between data sets’ and by doing so, data work promotes judgments about what data should speak to what other data. As our Citizen Mindscapes collaboration suggests, making ‘data talk’ to other data sets, or to interpreters of data, is permeated by moments of breakdown and repair that call for a richer understanding of everyday data practices. The intent of this paper has been to suggest that a focus on data breakages is an opportunity to learn about everyday data worlds, and to account for how data breakages challenge the linear, solutionist, and triumphant stories of big data.


Bell, G. (2015). ‘The secret life of big data’. In Data, now bigger and better! Eds. T. Boellstorf and B. Maurer. Publisher: Prickly Paradigm Press,7-26

Dant, T., 2010. The work of repair: Gesture, emotion and sensual knowledge. Sociological Research Online, 15(3), p.7.

Domínguez Rubio, F. (2016) ‘On the discrepancy between objects and things: An ecological approach’ Journal of Material Culture. 21(1): 59–86

Jackson, S.J. (2014) ‘Rethinking repair’ in T. Gillespie, P. Boczkowski, and K. Foot, eds. Media Technologies: Essays on Communication, Materiality and Society. MIT Press: Cambridge MA

Lagus, K. M. Pantzar, M. Ruckenstein, and M. Ylisiurua. (2016) Suomi24: Muodonantoa aineistolle. The Consumer Society Research Centre. Helsinki: Faculty of Social Sciences, University of Helsinki.

Moore, H (1999) Anthropological theory at the turn of the century in H. Moore (ed) Anthropological theory today. Cambridge: Polity Press, pp. 1-23.

Moore, H. L. (2004). Global anxieties: concept-metaphors and pre-theoretical commitments in anthropology. Anthropological theory, 4(1), 71-88.

Pink et al, forthcoming. Broken data: data metaphors for an emerging world. Big data & Society.

Tanweer, A., Fiore-Gartland, B., & Aragon, C. (2016). Impediment to insight to innovation: understanding data assemblages through the breakdown–repair process. Information, Communication & Society, 19(6), 736-752.

Ruckenstein-Broken data and repair work-204_a.pdf
11:00am - 12:30pmT-PIII-1: Open and Closed
Session Chair: Olga Holownia
11:00am - 11:30am
Long Paper (20+10min) [abstract]

When Open becomes Closed: Findings of the Knowledge Complexity (KPLEX) Project.

Jennifer Edmond, Georgina Nugent Folan, Vicky Garnett

Trinity College Dublin, Ireland

The future of cultural heritage seems to be all about “data.” A Google search on the term ‘data’ returns over 5.5 billion hits, but the fact that the term is so well embedded in modern discourse does not necessarily mean that there is a consensus as to what it is or should be. The lack of consensus regarding what data are on a small scale acquires greater significance and gravity when we consider that one of the major terminological forces driving ICT development today is that of "big data." While the phrase may sound inclusive and integrative, "big data" approaches are highly selective, excluding any input that cannot be effectively structured, represented, or, indeed, digitised. The future of DH, of any approaches to understanding complex phenomena or sources such as are held in cultural heritage institutions, indeed the future of our increasingly datafied society, depend on how we address the significant epistemological fissures in our data discourse. For example, how can researchers claim that "when we speak about data, we make no assumptions about veracity" while one of the requisites of "big data" is "veracity"? On the other hand, how can we expect humanities researchers to share their data on open platforms such as the European Open Science Cloud (EOSC) when we, as a community, resist the homogenisation implied and required by the very term “data”, and share our ownership of it with both the institutions that preserve it and the individuals that created it? How can we strengthen European identities and transnational understanding through the use of ICT systems when these very systems incorporate and obscure historical biases between languages, regions and power elites? In short, are we facing a future when the mirage of technical “openness” actually closes off our access to the perspectives, insight and information we need as scholars and as citizens? Furthermore, how might this dystopic vision be avoided?

These are the kinds of questions and issues under investigation by the European Horizon 2020 funded Knowledge Complexity (KPLEX) project. by applying strategies developed by humanities researchers to deal with complex messy, cultural data; the very kind of data that resists datafication and poses the biggest challenges to knowledge creation in large data corpora environments. Arising out of the findings of the KPLEX project, this paper will present the synthesised findings of an integrated set of research questions and challenges addressed by a diverse team led by Trinity College Dublin (Ireland) and encompassing researchers in Freie Universität Berlin (Germany), DANS-KNAW (The Hague) and TILDE (Latvia). We have adopted a comparative, multidisciplinary, and multi-sectoral approach to addressing the issue of bias in big data; focussing on the following 4 key challenges to the knowledge creation capacity of big data approaches:

1. Redefining what data is and the terms we use to speak of it (TCD);

2. The manner in which data that are not digitised or shared become "hidden" from aggregation systems (DANS-KNAW);

3. The fact that data is human created, and lacks the objectivity often ascribed to the term (FUB);

4. The subtle ways in which data that are complex almost always become simplified before they can be aggregated (TILDE).

The paper will presenting a synthesised version of these integrated research questions, and discuss the overall findings and recommendations of the project, which completes its work at the end of March 2018. What follows gives a flavour of the work ongoing at the time of writing this abstract, and the issues that will be raised in the DHN paper.

1. Redefining what data is and the terms we use to speak of it. Many definitions of data, even thoughtful scholarly ones, associate the term with a factual or objective stance, as if data were a naturally occurring phenomenon. But data is not fact, nor is it objective, nor can it be honestly aligned with terms such as ‘signal’ or ‘stimulus,’ or the quite visceral (but misleading) ‘raw data.’ To become data, phenomena must be captured in some form, by some agent; signal must be separated from noise, like must be organised against like, transformations occur. These organisational processes are human determined or human led, and therefore cannot be seen as wholly objective; irrespective of how effective a (human built) algorithm may be. The core concerns of this facet of the project was to expand the understanding of the heterogeneity of definitions of data, and the implications of this state of understanding. Our primary ambition under this theme was to establish a clear taxonomy of existing theories of data, to underpin a more applied, comparative comparison of humanistic versic technical applications of the term. We did this by identifying the key terms (and how they are used differently), key points of bifurcation, and key priorities under each conceptualisation of data. As such, this facet of the project supported the integrated advancement of the three other project themes, as well as itself developing new perspectives on the rhetorical stakes and action implications of differing concepts of the term ‘data’ and how these will impact on the future not only of DH but of society at large.

2. Dealing with ‘hidden’ data. According to the 2013 ENUMERATE Core 2 survey, only 17% of the analogue collections of European heritage institutions had at that time been digitised. This number actually represents a decrease over the findings of their 2012 survey (almost 20%). The survey also reached only a limited number of respondents: 1400 institutions over 29 countries, which surely captures the major national institutions but not local or specialised ones. Although the ENUMERATE Core 2 report does not break down these results by country, one has to imagine that there would be large gaps in the availability of data from some countries over others. Because so much of this data has not been digitised, it remains ‘hidden’ from potential users. This may have always been the case, as there have always been inaccessible collections, but in a digital world, the stakes and the perceptions are changing. The fact that so much other material is available on-line, and that an increasing proportion of the most well-used and well-financed cultural collections are as well, means that the reasonable assumption of the non-expert user of these collections is that what cannot be found does not exist (whereas in the analogue age, collections would be physically contextualised with their complements, leaving the more likely assumption to be that more information existed, but could not be accessed). The threat that our narratives of histories and national identities might thin out to become based on only the most visible sources, places and narratives is high. This facet of the project explored the manner in which data that are not digitised or shared become "hidden" from aggregation systems.

3. Knowledge organisation and epistemics of data. The nature of humanities data is such that even within the digital humanities, where research processes are better optimised toward the sharing of digital data, sharing of "raw data" remains the exception rather than the norm. The ‘instrumentation’ of the humanities researcher consists of a dense web of primary, secondary and methodological or theoretical inputs, which the researcher traverses and recombines to create knowledge. This synthetic approach makes the nature of the data, even at its ‘raw’ stage, quite hybrid, and already marked by the curatorial impulse that is preparing it to contribute to insight. This aspect may be more pronounced in the humanities than in other fields, but the subjective element is present in any human triggered process leading to the production or gathering of data. Another element of this is the emotional. Emotions are motivators for action and interaction that relate to social, cultural, economic and physiological needs and wants. Emotions are crucial factors in relating or disconnecting people from each other. They help researchers to experientially assess their environments, but this aspect of the research process is considered taboo, as noise that obscures the true ‘factual signal’, and as less ‘scientific’ (seen in terms of strictly Western colonialist paradigms of knowledge creation) than other possible contributors to scientific observation and analysis. Our primary ambition here was to explore the data creation processes of the humanities and related research fields to understand how they combine pools of information and other forms of intellectual processing to create data that resists datafication and ‘like-with-like’ federation with similar results. The insights gained will make visible many of the barriers to the inclusion of all aspects of science under current Open Science trajectories, and reveal further central elements of social and cultural knowledge that are unable to be accommodated under current conceptualisations of ‘data’ and the systems designed to use them.

4. Cultural data and representations of system limitations. Cultural signals are ambiguous, polysemic, often conflicting and contradictory. In order to transform culture into data, its elements – as all phenomena that are being reduced to data – have to be classified, divided, and filed into taxonomies and ontologies. This process of 'data-fication' robs them of their polysemy, or at least reduces it. One of the greatest challenges for so-called Big Data is the analysis and processing of multilingual content. This challenge is particularly acute for unstructured texts, which make up a large portion of the Big Data landscape. How do we deal with multilingualism in Big Data analysis? What are the techniques by which we can analyze unstructured texts in multiple languages, extracting knowledge from multilingual Big Data? Will new computational techniques such as AI deep learning improve or merely alter the challenges? The current method for analyzing multilingual Big Data is to leverage language technologies such as machine translation, terminology services, automated speech recognition, and content analytics tools. In recent years, the quality and accuracy of these key enabling technologies for Big Data has improved substantially, making them indispensable tools for high-demand applications with a global reach. However, just as not all languages are alike, the development of these technologies differs for each language. Larger languages with high populations have robust digital resources for their languages, the result of large-scale digitization projects in a variety of domains, including cultural heritage information. Smaller languages have resources that are much more scant. Those resources that do exist may be underpinned by far less robust algorithms and far smaller bases for the statistical modelling, leading to less reliable results, a fact that in large scale, multilingual environments (like Google translate) is often not made transparent to the user. The KPLEX project is exploring and describing the nature and potential for ‘data’ within these clearly defined sets of actors and practices at the margins of what is currently able to be approached holistically using computational methods. It is also envisioning approaches to the integration of hybrid data forms within and around digital platforms, leading not so much to the virtualisation of information generation as approaches to its augmentation.

Edmond-When Open becomes Closed-175_a.pdf

11:30am - 11:45am
Short Paper (10+5min) [publication ready]

Open, Extended, Closed or Hidden Data of Cultural Heritage

Tuula Pääkkönen1, Juha Rautiainen1, Toni Ryynänen2, Eeva Uusitalo2

1National Library of Finland, Finland; 2Ruralia Institute, University of Helsinki, Finland

The National Library of Finland (NLF) agreed on an “Open National Library” policy in 2016[1]. In the policy there are eight principles, which are divided into accessibility, openness in actions and collaboration themes. Accessibility in the NLF means that access to the material needs to exist both for the metadata and the content, while respecting the rights of the rights holders. Openness in operations means that our actions and decision models are transparent and clear, and that the materials are accessible to the researchers and other users. These are one way in which the NLF can implement the findable, accessible, interoperable, re-usable (FAIR) data principles [2] themes in practise.

The purpose of this paper is to view the way in which the policy has impacted our work and how findability and accessibility have been implemented in particular from the aspects of open, extended, closed and hidden data themes. In addition, our aim is to specify the characteristics of existing and potential forms of data produced by the NLF from the research and development perspectives. A continuous challenge is the availability of the digital resources – gaining access to the digitised material for both researchers and the general public, since there are also constant requests for access to newer materials outside the legal deposit libraries’ work stations

Pääkkönen-Open, Extended, Closed or Hidden Data of Cultural Heritage-224_a.pdf
Pääkkönen-Open, Extended, Closed or Hidden Data of Cultural Heritage-224_c.pdf

11:45am - 12:00pm
Distinguished Short Paper (10+5min) [publication ready]

Aalto Observatory for Digital Valuation Systems

Jenni Huttunen1, Maria Joutsenvirta2, Pekka Nikander1

1Aalto University, Department of Communications and Networking; 2Aalto University, Department of Management Studies

Money is a recognised factor in creating sustainable, affluent societies. Yet, the neoclassical orthodoxy that prevails in our economic thinking remains as a contested area, its supporters claiming their results to be objective- ly true while many heterodox economists claim the whole system to stand on clay feet. Of late, the increased activity around complementary currencies suggest that the fiat money zeitgeist might be giving away to more variety in our monetary system. Rather than emphasizing what money does, as the mainstream economists do, other fields of science allow us to approach money as an integral part of the hierarchies and networks of exchange through which it circulates. This paper suggests that a broad understanding of money and more variety in monetary system have great potentials to further a more equalitarian and sustainable economy. They can drive the extension of society to more inclusive levels and transform people’s economic roles and identities in the process. New technologies, including blockchain and smart ledger technology are able to support decentralized money creation through the use of shared and “open” peer-to-peer rewarding and IOU systems. Alongside of specialists and decision makers’ capabilities, our project most pressingly calls for engaging citizens into the process early on. Multidisciplinary competencies are needed to take relevant action to investigate, envision and foster novel ways for value creation. For this, we are forming the Aalto Observatory on Digital Valuation Systems to gain deeper understandings of sustainable value creation structures enabled by new technology.

Huttunen-Aalto Observatory for Digital Valuation Systems-191_a.pdf
Huttunen-Aalto Observatory for Digital Valuation Systems-191_c.pdf

12:00pm - 12:15pm
Short Paper (10+5min) [publication ready]

Challenges and perspectives on the use of open cultural heritage data across four different user types: Researchers, students, app developers and hackers

Ditte Laursen1, Henriette Roued-Cunliffe2, Stig Svennigsen1

1Royal Danish Library; 2University of Copenhagen

In this paper, we analyse and discuss from a user perspective and from an organisational perspective the challenges and perspectives of the use of open cultural heritage data. We base our study on empirical evidence gathered through four cases where we have interacted with four different user groups: 1) researchers, 2) students, 3) app developers and 4) hackers. Our own role in these cases was to engage with these users as teachers, organizers and/or data providers. The cultural heritage data we provided were accessible as curated data sets or through API's. Our findings show that successful use of open heritage data is highly dependent on organisations' ability to calibrate and curate the data differently according to contexts and settings. More specifically, we show what different needs and motivations different user types have for using open cultural heritage data, and we discuss how this can be met by teachers, organizers and data providers.

Laursen-Challenges and perspectives on the use of open cultural heritage data across four different user_a.pdf

12:15pm - 12:30pm
Short Paper (10+5min) [abstract]

Synergy of contexts in the light of digital humanities: a pilot study

Monika Porwoł

State University of Applied Sciences in Racibórz

The present paper describes a pilot study pertaining to the linguistic analysis of meaning with regard to the word ladder[EN]/drabina[PL] taking into account views of digital humanities. Therefore, WordnetLoom mapping is introduced as one of the existing research tools proposed by CLARIN ERIC research and technology infrastructure. The explicated material comprises retrospective remarks and interpretations provided by 74 respondents, who took part in a survey. A detailed classification of multiple word’s meanings is presented in a tabular way (showing the number of contexts, in which participants accentuate the word ladder/drabina) along with some comments and opinions. Undoubtedly, the results suggest that apart from the general domain of the word offered for consideration, most of its senses can usually be attributed to linguistic recognitions. Moreover, some perspectives on the continuation of future research and critical afterthoughts are made prominent in the last part of this paper.

Porwoł-Synergy of contexts in the light of digital humanities-106_a.doc
Porwoł-Synergy of contexts in the light of digital humanities-106_c.pdf
11:00am - 12:30pmT-PIV-1: Newspapers
Session Chair: Mats Malm
11:00am - 11:30am
Long Paper (20+10min) [publication ready]

A Study on Word2Vec on a Historical Swedish Newspaper Corpus

Nina Tahmasebi

Göteborgs Universitet,

Detecting word sense changes can be of great interest in the field of digital humanities. Thus far, most investigations and automatic methods have been developed and carried out on English text and most recent methods make use of word embeddings. This paper presents a study on using Word2Vec, a neural word embedding method, on a Swedish historical newspaper collection. Our study includes a set of 11 words and our focus is the quality and stability of the word vectors over time. We investigate to which extent a word embedding method like Word2Vec can be used effectively on texts where the volume and quality is limited.

Tahmasebi-A Study on Word2Vec on a Historical Swedish Newspaper Corpus-157_a.pdf

11:30am - 11:45am
Short Paper (10+5min) [abstract]

A newspaper atlas: Named entity recognition and geographic horizons of 19th century Swedish newspapers

Erik Edoff

Umeå University

What was the outside world for 19th century newspaper readers? That is the overarching problem investigated in this paper. One way of facing this issue is to investigate what geographical places that was mentioned in the newspaper, and how frequently. For sure, newspapers were not the only medium that contributed to 19th century readers’ notion of the outside world. Public meetings, novels, sermons, edicts, travelers, photography, and chapbooks are other forms of media that people encountered with a growing regularity during the century; however, newspapers often covered the sermons, printed lists of travelers and attracted readers with serial novels. This means, at least to some extent, that these are covered in the newspapers columns. And after all, the newspapers were easier to collect and archive than a public meeting, and thus makes it an accessible source for the historian.

Two newspapers, digitized by the National Library of Sweden, are analyzed: Tidning för Wenersborgs stad och län (TW) and Aftonbladet (AB). They are chosen based on their publishing places’ different geographical and demographical conditions as well as the papers’ size and circulation. TW was founded in 1848 in the town of Vänersborg, located on the western shore of lake Vänern, which was connected with the west coast port, Göteborg, by the Trollhätte channel, established in 1800. The newspaper was published in about 500 copies once a week (twice a week from 1858) and addressed a local and regional readership. AB was a daily paper founded in Stockholm in 1830 and was soon to become the leading liberal paper of the Swedish capital, with a great impact on national political discourse. For its time, it was widely circulated (between 5,000 and 10,000 copies) in both Stockholm and the country as a whole. Stockholm was an important seaport on the eastern coast. These geographic distinctions probably mean interesting differences in the papers’ respective outlook. The steamboats revolutionized travelling during the first half of the century, but its glory days had passed around 1870, and was replaced by railways as the most prominent way of transporting people.

This paper is focusing on comparing the geographies of the two newspapers by analyzing the places mentioned in the periods 1848–1859 and 1890–1898. The main railroads of Sweden were constructed during the 1860s, and the selected years therefore cover newspaper geographies before and after railroads.

The main questions of paper addresses relate to media history and history of media infrastructure. During the second half of the 19th century several infrastructure technologies were introduced and developed (electric telegraph, postal system, newsletter corporations, railways, telephony, among others). The hypothesis is that these technologies had an impact on the newspapers’ geographies. The media technologies enabled information to travel great distances in short timespans, which could have homogenizing effects on newspaper content, which is suggested by a lot of traditional research (Terdiman 1999). On the other hand, digital historical research has shown that the development of railroads changed the geography of Houston newspapers, increasing the importance of the near region rather than concentrating geographic information to national centers (Blevins 2014).

The goal of the study is in other words to investigate what these the infrastructural novelties introduced during the course of the 19th century as well as the different geographic and demographic conditions meant for the view of the outside world or the imagined geographies provided by newspapers. The aim of the paper is therefore twofold: (1) to investigate a historical-geographical problem relating to newspaper coverage and infrastructural change and (2) to tryout the use of Named Entity Recognition on Swedish historical newspaper data.

Named Entity Recognition (NER) is a software that is designed to locate and tag entities, such as persons, locations, and organizations. This paper uses SweNER to mine the data for locations mentioned in the text (Kokkinakis et al. 2014). Earlier research has emphasized the problems with bad OCR-scanning of historical newspapers. A picture of a newspaper page is read by an OCR-reading software and converted into a text file. The result contains a lot of misinterpretations and therefore considerable amount of noise (Jarlbrink & Snickars 2017). This is a big obstacle when working with digital tools on historical newspapers. Some earlier research has used and evaluated the performance of different NER-tools on digitized historical newspapers, also underlining the OCR-errors as the main problem with using NER on such data (Kettunen et al. 2017). SweNER has also been evaluated in tagging named entities in historical Swedish novels, where the OCR problems are negligible (Borin et al 2007). This paper, however, does not evaluate the software’s result in a systematic way, even though some important biases have been identified by going through the tagging of some newspaper copies manually. Some important geographic entities are not tagged by SweNER at all (e.g. Paris, Wien [Vienna], Borås and Norge [Norway]). SweNER is able to pick up some OCR-reading mistakes, although many recurring ones (e.g. Lübeck read as Liibeck, Liibcck, Ltjbeck, Ltlbeck) are not tagged by SweNER. These problems can be handled, at least to some degree, by using “leftovers” from the data (wrongly spelled words) that was not matched in a comparison corpus. I have manually scanned the 50,000 most frequently mentioned words that was not matched in the comparative corpus, looking for wrongly spelled names of places. I ended up with a list of around 1,000 places and some 2,000 spelling variations (e.g. over 100 ways of spelling Stockholm). This manually constructed list could be used as a gazetteer, complementing the NER-result, giving a more accurate result of the 19th century newspaper geographies.


Blevins, C. (2014), ”Space, nation, and the triumph of region: A view on the world from Houston”, Journal of American History, Vol. 101, no 1, pp. 122–147.

Borin, L., Kokkinakis, D., and Olsson, L-G. (2007), “Naming the past: Named entity and animacy recognition in 19th century Swedish literature”, Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007), pp. 1–8, available at: (accessed October 31 2017).

Jarlbrink, J. and Snickars, P. (2017), “Cultural heritage as digital noise: Nineteenth century newspapers in the digital archive”, Journal of Documentation, Vol. 73, no 6, pp. 1228–1243.

Kettunen, K., Mäkelä, E., Ruokolainen, T., Kuokkala, J., and Löfberg, L. (2017), ”Old content and modern tools: Searching named entities in a Finnish OCRed historical newspaper collection 1771–1910”, Digital Humanities Quarterly, (preview) Vol. 11, no 3.

Kokkinakis, D., Niemi, J., Hardwick, S., Lindén, K., and Borin, L., (2014), ”HFST-SweNER – A new NER resource for Swedish”, Proceedings of the 9th edition of the Language Resources and Evaluation Conference (LREC), Reykjavik 26–31 May 2014., pp. 2537-2543

Terdiman, R. (1999) “Afterword: Reading the news”, Making the news: Modernity & the mass press in nineteenth-century France, Dean de la Motte & Jeannene M. Przyblyski (eds.), Amherst: University of Massachusetts Press.

Edoff-A newspaper atlas-177_a.pdf
Edoff-A newspaper atlas-177_c.pdf

11:45am - 12:00pm
Short Paper (10+5min) [abstract]

Digitised newspapers and the geography of the nineteenth-century “lingonberry rush” in Finland

Matti La Mela

History of Industrialization & Innovation group, Aalto University,

This paper uses digitised newspaper data for analysing practices of nature use. In the late nineteenth century, a “lingonberry rush” developed in Sweden and Finland due to the growing foreign demand and exports of lingonberries. The Finnish newspapers followed carefully the events in the neighbouring Sweden, and reported on their pages about the export tons and the economic potential this red gold could have also for Finland. The paper is interested in the geography of this “lingonberry rush”, and explores how the unprecise geographic information about berry-picking can be gathered and structured from the digitised newspapers (metadata and NER). The paper distinguishes between unique and original news, and longer chains of news reuse. The paper takes use of open tools for named-entity recognition and text reuse detection. This geospatial analysis adds to the reinterpretation of the history of the Nordic allemansrätten, a tradition of public access to nature, which allows everyone to pick wild berries today; the circulation of commercial news on lingonberries in the nineteenth century enforced the idea of berries as a commodity, and ultimately facilitated to portray the wild berries as an openly accessible resource.

La Mela-Digitised newspapers and the geography of the nineteenth-century “lingonberry rush”-255_a.pdf
La Mela-Digitised newspapers and the geography of the nineteenth-century “lingonberry rush”-255_c.pdf

12:00pm - 12:15pm
Short Paper (10+5min) [abstract]

Sculpting Time: Temporality in the Language of Finnish Socialism, 1895–1917

Risto Turunen

University of Tampere,

Sculpting Time: Temporality in the Language of Finnish Socialism, 1895–1917

The Grand Duchy of Finland had relatively the biggest socialist party of Europe in 1907. The breakthrough of Finnish socialism has not yet been analyzed from the perspective of ‘temporality’, that is, the way human beings experience time. Socialists constructed their own versions of the past, present and future that differed from competing Christian and nationalist perceptions of time. This paper examines socialist experiences and expectations primarily by a quantitative analysis of Finnish handwritten and printed newspapers. Three main questions will be solved by combining traditional conceptual-historical approaches with corpus-linguistic methods of collocation, keyness and key collocation. First, what is the relation of the past, present and future in the language of socialism? Second, what are the key differences between socialist temporality and non-socialist temporalities of the time? Third, how do the actual revolutionary moments of 1905 and 1917 affect socialist temporality and vice versa: did the revolution of time in the consciousness of the people lead to the time of revolution in Finland? The hypothesis is that identifying the changes in temporality will improve our historical understanding of the political ruptures in Finland in the early twentieth century. The results will be compared to Reinhart Koselleck’s theory of ‘temporalization of concepts’ – expectations towards future supersede experiences of the past in modernity –, and to Benedict Anderson’s theory of ‘imagined communities’ which suggests that the advance of print capitalism transformed temporality from vertical to horizontal. The paper forms a part of my on-going dissertation project, which merges close reading of archival sources with computational distant reading of digital materials, thus producing a macro-scale picture of the political language of Finnish socialism.

Turunen-Sculpting Time-198_a.pdf
Turunen-Sculpting Time-198_c.pdf

12:15pm - 12:30pm
Short Paper (10+5min) [abstract]

Two cases of meaning change in Finnish newspapers, 1820-1910

Antti Kanner

University of Helsinki,

In Finland the 19th century saw the formation of number of state institutions that came to define the political life of the Grand Duchy and of subsequent independent republic. Alongside legal, political, economical and social institutions and organisations, the Modern Finnish, as institutionally standardised language, can be seen in this context as one of these institutions. As the majority of residents of Finland were native speakers of Finnish dialects, adopting Finnish was necessary for state’s purposes in extending its influence within the borders of autonomous Grand Duchy. Obviously widening domains of use of Finnish played also an important role in the development of Finnish national identity. In the last quarter of 19th century Finnish started to gain ground as the language of administrative, legal and political discourses alongside Swedish. It is this period we find the crucial conceptual processes that shape Finnish political history well into 20th century.

In this paper I will present two related case studies from my doctoral research, where I seek to understand the semantic similarity scores of so-called Semantic Vector Spaces obtained from large historical corpora in terms of linguistic semantics. As historical corpora are collections of past speech acts, view they provide to changing meanings of words is as much influenced by pragmatic factors and writers’ intentions as synchronic semantics. Understanding and explicating the historical context of observed processes is essential when studying temporal dynamics in semantic changes. For this end, I will try to reflect the theoretical side of my work in the light of cases of historical meaning changes. My research falls under the heading of Finnish Language, but is closely related to history and computational linguistics.

The main data for my research comes from the National Library of Finland’s Newspaper Collection, which I use via KORP service API provided by Language Bank of Finland. The collection accessible via the API contains nearly all newspapers and periodicals published in Finland from 1771 to 1910. The collection is however very heterogenous, as the press and other forms of printed public discourse in Finnish only developed in Finland during the 19th century. Historical variation in conventions of typesetting, editing and orthography as well as paper quality used for printing make it very difficult for OCR systems to recognize characters with 100 percent accuracy. Kettunen et. al. estimated that OCR accuracy is actually somewhere between 60 and 80 percent. However, not all problems in the automatic recognition of the data come from OCR problems or even historical spelling variation. Much is also due to linguistic factors: the 19th century saw large scale dialectal, orthographical and lexical variation in written Finnish. To exemplify the scale of variation, when a morphological analyser for Modern Finnish (OMORFI, Pirinen 2015) was used, it could only parse around 60 percent of the wordlist of the Corpus of Early Modern Finnish (CEMF).

For the reason of unreliability of results from automated parser and the temporal heterogeneity inherent in the data, conducting the study with methodology robust for these kinds of problems poses a challenge. The approach chosen was to use number of analysis and see whether their results could be combined to produce a coherent view of the historical change in word use. In addition, simpler and more robust analysis were chosen instead of more advanced and elaborated ones. For example, analysis similar to topic modelling was conducted using second order collocations (Bertels & Speelman 2014 and Heylen, Wielfaerts, Speelman, Geeraerts 2014) instead of algorithms like LDA (Blei, Ng & Young 2004), that are widely used for this purpose. This was because the data contains an highly inflated count of individual types and lemmas resulting from the problems with OCR and morphological analysis. It seemed that in this specific case at least, LDA was not able to produce sensible topics because the number of hapax legomena per text was so high. The analysis applied based on second order collocations aimed not at producing a model of system of topics, as the LDA does, but to simply cluster studied word’s collocating words based on their respective similarities. Also when tracking changes in words’ syntactic positioning tendencies, instead of resource intensive syntactic parsing, that is also sensitive to errors in data, simple morphological case distribution was used. When the task is to track signals of change, morphological case distributions can be used as sufficient proxies for dependency distributions. This can be done on the grounds that the case selection in Finnish is mostly governed by syntax, as case selection is used to express syntactic relations between, for example constituents of nominal phrases or predicate verb and its arguments (Vilkuna 1989).

The first of my case studies focuses on Finnish word maaseutu. Maaseutu denotes rural area but is in Modern Finnish mostly used as a collective singular referring to the rural as a whole. It is most commonly used as an opposite to the urban, which is often lexicalised as kaupunki, the city, used in similar collective meaning. However, after its introduction to Finnish in 1830’s maaseutu was used in variety of related meanings, mostly referring to specific rural areas or communities, until the turn of the century, when the collective singular sense had become dominant. Starting roughly from 1870’s, however, there seems to have been a period of contesting uses. At that time we find a number of vague cases where the meanings generic or collective and specific meanings overlap.

Combining information from my analysis to newspaper metadata yields an image of dynamic situation. The emergence of the collective singular stands out clearly and is being connected to an accompanying discourse of negotiating urban-rural relations on a national instead of regional level. This change can be pinpointed quite precisely to 1870’s and to the newspapers with geographically wider circulation and more national identity.

The second word of interest is vaivainen, an adjective referring to a person or a thing either being of wretched or inadequate quality or suffering from an physical or mental ailment. When used as a noun, it refers to a person of very low and excluded social status and extreme poverty. In this meaning the word appears in Modern Finnish mostly in poetically archaic or historical contexts but has disappeared from vocabulary of social policy or social legislation already in the early 20th century. The word has a biblical background, being used in older Finnish Bible translations, in for example Sermon on the Mount (as the equivalent of poor in Matt. 5:13 “blessed are the poor in spirit”), and as such was natural choice to name the recipients of church charities. When the state poverty relief system started to take its form in the mid 19th century, it built on top of earlier church organizations (Von Aerschot 1996) and the church terminology was carried over to the state institutions.

When tracking the contexts of the word over the 19th century using context word clusters based on second order collocations, two clear discoursal trends appear: the poverty relief discourse that already in the 1860’s is pronounced in the data disperses into a complex network of different topics and discoursive patterns. As the state run poverty relief institutions become more complex and more efficiently administered, the moral foundings of the whole enterprise are discussed alongside reports of everyday comings and goings of individual institutions or, indeed, tales of individual relief recipients fortunes. The other trend involves the presence of religious or spiritual discourse which, against preliminary assumptions does not wane into the background but experiences a strong surge in the 1870’s and 1880’s. This can be explained in part by growth of revivalist Christian publications in the National Library Corpus, but also by intrusion of Christian connotations in the political discussion on poverty relief system. It is as if the word vaivainen functions as a kind of lightning rod of Christian morality into the public poverty relief discourse.

While methodological contributions of this paper are not highly ambitious in terms of language technology or computational algorithms used, the selection of analysis presents an innovative approach to Digital Humanities. The aim here has been to combine not just one, but an array of simple and robust methods from computational linguistics to theoretical background and analytical concepts from lexical semantics. I argue that robustness and simplicity of methods makes the overall workflow more transparent, and this transparency makes it easier to interpret the results in wider historical context. This allows to ask questions whose relevance is not confined to computational linguistics or lexical semantics, but expands to wider areas of Humanities scholarship. This shared relevance of questions and answers, to my understanding, lies at the core of Digital Humanities.


Bertels, A. & Speelman, D. (2014). “Clustering for semantic purposes. Exploration of semantic similarity in a technical corpus.” Terminology 20:2, pp. 279–303. John Benjamins Publishing Company.

Blei, D., Ng, A. Y. & Jordan, M. I. (2003). “Latent Dirichlecht Allocation.” Journal of Machine Learning Research 3 (4–5). Pp. 993–1022.

CEMF, Corpus of Early Modern Finnish. Centre for Languages in Finland.

Heylen, C., Peirsman Y., Geeraerts, D. & Speelman, D. (2008). “Modelling Word Similarity: An Evaluation of Automatic Synonymy Extraction Algorithms.” Proceedings of LREC 2008.

Huhtala, H. (1971). Suomen varhaispietistien ja rukoilevaisten sanankäytöstä :

semanttis-aatehistoriallinen tutkimus. [On the vocabulary of the early Finnish pietist

and revivalist movements]. Suomen Teologinen Kirjallisuusseura.

Kettunen, K., Honkela, T., Lindén, K., Kauppinen, P., Pääkkönen, T. & Kervinen, J.

(2014). “Analyzing and Improving the Quality of a Historical News Collection

using Language Technology and Statistical Machine Learning Methods”. In

IFLA World Library and Information Congress Proceedings : 80th IFLA

General Conference and Assembly. Lyon. France.

Pirinen, T. (2015). “Omorfi—Free and open source morphological lexical database for

Finnish”. In Proceedings of the 20th Nordic Conference of Computational

Linguistics NODALIDA 2015.

Vilkuna, M. (1989). Free word order in Finnish: Its syntax and discourse functions.

Suomalaisen Kirjallisuuden Seura.

Von Aerschot, P. (1996). Köyhät ja laki: toimeentukilainsäädännön kehittyminen kehitys

oikeudellistusmisprosessien valossa. [The poor and the law: development of Finnish

welfare legislation in light juridification processes.] Suomalainen Lakimiesyhdistys.

Kanner-Two cases of meaning change in Finnish newspapers, 1820-1910-245_a.pdf
11:00am - 12:30pmT-P674-1: Place
Session Chair: Christian-Emil Smith Ore
11:00am - 11:30am
Long Paper (20+10min) [publication ready]

SDHK meets NER: Linking place names with medieval charters and historical maps

Olof Karsvall2, Lars Borin1

1University of Gothenburg,; 2Swedish National Archives

Mass digitization of historical text sources opens new avenues for research in the humanities and social sciences, but also presents a host of new methodological challenges. Historical text collections become more accessible, but new research tools must also be put in place in order to fully exploit the new research possibilities emerging from having access to vast document collections in digital format. This paper highlights some of the conditions to consider when place names in an older source material, in this case medieval charters, are to be matched to geographical data. The Swedish National Archives make some 43,000 medieval letters available in digital form through an online search facility. The volume of the material is such that manual markup of names will not be feasible. In this paper, we present the material, discuss the promises for research of linking, e.g., place names to other digital databases, and report on an experiment where an off-the-shelf named-entity recognition system for modern Swedish is applied to this material.

Karsvall-SDHK meets NER-167_a.pdf

11:30am - 11:45am
Distinguished Short Paper (10+5min) [publication ready]

On Modelling a Typology of Geographic Places for the Collaborative Open Data Platform histHub

Manuela Weibel, Tobias Roth

Schweizerisches Idiotikon

HistHub will be a platform for Historical Sciences providing authority records for interlinking and referencing basic entities such as persons, organisations, concepts and geographic places within an ontological framework. For the case of geographic places, a draft of a place typology is presented here. Such a typology will be needed for semantic modelling in an ontology. We propose a hierarchical two-step model of geographic place types: a more generic type remaining stable over time that will ultimately be incorporated into the ontology as the essence of the identity of a place, and a more specific type closer to the nature of the place the way it is actually perceived by humans.

Our second approach on our way to a place typology is decidedly bottom-up. We try to standardise the place types in our database of heterogeneous top-onymic data using the place types already present as well as textual descriptions and name matches with typed external data sources. The types used in this standardisation process are basic conceptual units that are most likely to play a role in any place typology yet to be established. Standardisation at this early stage leads to comprehensive and deep knowledge of our data which helps us developing a good place typology.

Weibel-On Modelling a Typology of Geographic Places for the Collaborative Open Data Platform histHub-138_a.pdf
Weibel-On Modelling a Typology of Geographic Places for the Collaborative Open Data Platform histHub-138_c.pdf

11:45am - 12:00pm
Distinguished Short Paper (10+5min) [publication ready]

Geocoding, Publishing, and Using Historical Places and Old Maps in Linked Data Applications

Esko Ikkala1, Eero Hyvönen1,2, Jouni Tuominen1,2

1Aalto University, Semantic Computing Research Group (SeCo); 2University of Helsinki, HELDIG – Helsinki Centre for Digital Humanities

This paper presents a Linked Open Data brokering service prototype for

using and maintaining historical place gazetteers and maps based on distributed SPARQL endpoints. The service introduces several novelties: First, the service facilitates collaborative maintenance of geo-ontologies and maps in real time as a side effect of annotating contents in legacy cataloging systems. The idea is to support a collaborative ecosystem of curators that creates and maintains data about historical places and maps in a sustainable way. Second, in order to foster understanding of historical places, the places can be provided on both modern and historical maps, and with additional contextual Linked Data attached. Third, since data about historical places is typically maintained by different authorities and in different countries, the service can be used and extended in a federated fashion, by including new distributed SPARQL endpoints (or other web services with a suitable API) into the system.

Ikkala-Geocoding, Publishing, and Using Historical Places and Old Maps-215_a.pdf
Ikkala-Geocoding, Publishing, and Using Historical Places and Old Maps-215_c.pdf

12:00pm - 12:15pm
Short Paper (10+5min) [abstract]

Using ArcGIS Online and Story Maps to visualise spatial history: The case of Vyborg

Antti Härkönen

University of Eastern Finland

Historical GIS (HGIS) or spatially oriented history is a field that uses geoinformatics to look at historical phenomena from a spatial perspective. GIS tools are used to visualize, manage and analyze geographical data. However, the use of GIS tools requires some technical expertise and ready-made historical spatial data is almost non-existent, which significantly reduces the reach of HGIS. New tools should make spatially oriented history more accessible.

Esri’s ArcGIS Online (AGOL) allows making internet visualization of maps and map layers created with Esri’s more traditional GIS desktop program ArcMap. In addition, Story Map tool allows the creation of more visually pleasing presentations using maps, text and multimedia resources. I will demonstrate the use of Story Maps to represent spatial change in the case of the city of Vyborg.

The city of Vyborg lies in Russia near the Finnish border. A small town grew near the castle founded by Swedes in 1293. Vyborg was granted town privileges in 1403, and later in the 15th century, it became one of the very few walled towns in Kingdom of Sweden. The town was located on a hilly peninsula near the castle. Until 17th century the town space was ‘medieval’ i.e. irregular. The town was regulated to conform to a rectangular street layout in 1640s. I show the similarities between old and new town plans by superimposing them on a map.

The Swedish period ended when the Russians conquered Vyborg in 1710. Vyborg became a provincial garrison town and administrative center. Later, when Russia conquered rest of Finland in 1809, the province of Vyborg (aka ‘Old Finland’) was added to the Autonomous Grand Duchy of Finland, a part of the Russian empire. During 19th century Vyborg became increasingly important trade and industrial center, and the population grew rapidly. I map expanding urban areas using old town plans and population statistics.

Another perspective to the changing town space is the growth of fortifications around Vyborg. As the range of artillery grew, the fortifications were pushed further and further outside the original town. I use story maps to show the position of fortifications of different eras by placing them in the context of terrain. I also employ viewshed analyses to show how the fortifications dominate the terrain around them.

Härkönen-Using ArcGIS Online and Story Maps to visualise spatial history-136_a.pdf

12:15pm - 12:30pm
Short Paper (10+5min) [abstract]


Kimmo Elo1, Virpi Kivioja2

1University of Helsinki; 2University of Turku

This paper is based on an ongoing PhD project entitled “An international triangle drama?”, which studies the depictions of West Germany and East Germany in Finnish, and depictions of Finland in West German and East German geography textbooks in the Cold War era. The primary source material consists of Finnish, West German, and East German geography textbooks that were published between 1946 and 1999.

Contrary to traditional methods of close reading thus far applied in school book analysis, this paper presents an exploratory approach based computational analysis of a large book corpus. The corpus consists of school books in geography used in the Federal Republic of Germany between 1946 and 1999, and in the German Democratic Republic between 1946 and 1990. The corpus has been created by digitising all books by applying OCR technologies on the scanned page images. The corpus has also been post-processed by correcting OCR errors and by adding metadata.

The main aim of the paper is to extract and analyse conceptual geocollocations. Such an analysis focuses on how concepts are embedded geospatially on the one hand, how geographical entities (cities, regions, etc.) are conceptually embedded, on the other. Regarding the former, the main aim is to examine and explain the geospatial distribution of terms and concepts. Regarding the latter, the main focus is on the analysis of concept collocations surrounding geographical entities.

The analysis presented in the paper consists of four steps. First, standard methods of text mining are used in order to identify geographical concepts (names of different regions, cities etc.). Second, concepts and terms in the close neighborhood of geographical terms are tagged with geocodes. Third, network analysis is applied to create concept networks around geographical entities. And fourth, both the geotagged and network data are enriched by adding bibliographical metadata allowing comparisons over time and between countries.

The paper adopts several methods to visualise analytical results. Geospatial plots are used to visualise geographical distribution of concept and its changes over time. Network graphs are used to visualise collocation structures and their dynamics. An important functions of the graphs, however, is to exemplify how graphical visualisations can be used to visualise historical knowledge and how graphical visualisations can help us to tackle change and continuity from a comparative perspective.

Concerning historical research from a more general perspective, one of the main objectives of this paper is to exemplify and discuss how computational methods could be applied to tackle research questions typical for social sciences and historical research. The paper is motivated by the big challenge to move away from computational history guided and limited by tools and methods of computational sciences toward an understanding that computational history requires computational tools developed to find answers to questions typical and crucial for historical research. All tools, data and methods developed during this research project will later be made available for scholars interested in similar topics, thus helping the to gain advantage of this project.

12:30pm - 2:00pmLunch
Main Building of the University, entrance from the Senate Square side
2:00pm - 3:30pmT-PII-2: Cultural Heritage and Art
Session Chair: Bente Maegaard
2:00pm - 2:30pm
Long Paper (20+10min) [publication ready]

Cultural Heritage `In-The-Wild': Considering Digital Access to Cultural Heritage in Everyday Life

David McGookin, Koray Tahiroglu, Tuomas Vaittinen, Mikko Kyto, Beatrice Monastero, Juan Carlos Vasquez

Aalto University,

As digital cultural heritage applications begin to be deployed outwith `traditional' heritage sites, such as museums, there is an increased need to consider their use amongst individuals who are open to learning about the heritage of a site, but where that is a clearly secondary purpose for their visit. Parks, recreational areas and the everyday built environment represent places that although rich in heritage, are often not visited primarily for that heritage. We present the results of a study of a mobile application to support accessing heritage on a Finnish recreational island. Evaluation with 45 participants, who were not there primarily to access the heritage, provided insight into how digital heritage applications can be developed for this user group. Our results showed how low immersion and lightweight interaction support individuals to integrate cultural heritage around their primary visit purpose, and although participants were willing to include heritage as part of their visit, they were not willing to be directed by it.

McGookin-Cultural Heritage `In-The-Wild-200_a.pdf

2:30pm - 2:45pm
Short Paper (10+5min) [publication ready]

Negative to That of Others, But Negligent of One’s Own? On Patterns in National Statistics on Cultural Heritage in Sweden

Daniel Brodén

Gothenburg University, Sweden,

In 2015–2016 the Centre for Critical Heritage Studies conducted an interdisciplinary pilot project in collaboration with the SOM-institute at the University of Gothenburg. A key ambition was to demonstrate the usefulness of combining an analysis rooted in the field of critical heritage studies and a statistical perspective. The study was based on a critical discussion of the concept of cultural heritage and collected data from the nationwide SOM-surveys.

The abstract will highlight some significant patterns in the SOM data from 2015 when it comes to differences between people regarding activities that are traditionally associated with national cul-tural heritage and culture heritage instititions: 1) women are more active than men when it comes to activities related to national cultural heritage; 2) class and education are also significant factors in this context. Since these patterns has been shown in prior research, perhaps the most interesting finding is that, 3) people who are negative to immigration from ‘other’ cultures to a lesser extent participates in activities that are associated with their ‘own’ cultural heritage.

Brodén-Negative to That of Others, But Negligent of One’s Own-114_a.doc

2:45pm - 3:00pm
Distinguished Short Paper (10+5min) [publication ready]

Engaging Collections and Communities: Technology and Interactivity in Museums

Paul Arthur

Edith Cowan University,

Museum computing is a field with a long history that has made a substantial impact on humanities computing, now called ‘digital humanities,’ that dates from at least the 1950s. Community access, public engagement, and participation are central to the charter of most museums and interactive displays are one strategy used help to fulfil that goal. Over the past two decades interactive elements have been developed to offer more immersive, realistic and engaging possibilities through incorporating motion-sensing spaces, speech recognition, networked installations, eye tracking and multitouch tables and surfaces. As museums began to experiment with digital technologies there was an accompanying change of emphasis and policy. Museums aimed to more consciously connect themselves with popular culture by experimenting with the presentation of their collections in ways that would result in in-creased public appreciation and accessibility. In this paper these shifts are investigated in relation to interactive exhibits, virtual museums, the profound influence of the database, and in terms of a wider breaking down of institutional barriers and hierarchies, resulting in trends towards increasing collaboration.

Arthur-Engaging Collections and Communities-270_a.pdf
Arthur-Engaging Collections and Communities-270_c.pdf

3:00pm - 3:15pm
Short Paper (10+5min) [abstract]

Art of the Digital Natives and Predecessors of Post-Internet Art

Raivo Kelomees

Estonian Academy of Arts, Estonia,

The new normal or the digital environment surrounding us has in recent years surprised us, at least in the fine arts, with the internet's content returning to its physical space. Is this due to pressure from the galleries or something else; in any case, it is clearer than ever that the audience is not separable from the habitual space; there is a huge and primal demand for physical or material art.

Christiane Paul in her article "Digital Art Now: The Evolution of the Post-Digital Age" in "ARS17: Hello World!" exhibition catalogue, is critical of the exhibition. Her main message is that all this has been done before. In itself the statement lacks originality, but in the context of the postinternet apologists declaring the birth of a new mentality, the arrival of a new "after experiencing the internet" and "post-digital" generation, it becomes clear that indeed it is rather like shooting fish in a barrel, because art that is critical of the digital and interactive has existed since the 1990s, as have works concerned with the physicalisation of the digital experience.

The background to the exhibition is the discussion over "digitally created" art and the generation related to it. The notion of "digital natives" is related to the post-digital and post-internet generation and the notion of "post-contemporary" (i.e. art is not concerned with the contemporary but with the universal human condition). Apparently for the digital natives, the internet is not a way out of the world anymore, but an original experience in which the majority of their time is spent. At the same time, however, the internet is a natural information environment for people of all ages whose work involves data collection and intellectual work. Communication, thinking, information gathering and creation – all of these realms are related to the digital environment. These new digital nomads travel from place to place and work in a "post-studio" environment.

While digital or new media was created, stored and shared via digital means, post-digital art addresses the digital without being stored using these same means. In other words, this kind of art exists more in the physical space.

Considerable reference also exists in relation to James Bridle's new aesthetics concept from 2012. In short, this refers to the convergence and conjoinment of the virtual and physical world. It manifests itself clearly even in the "pixelated" design of consumer goods or in the oeuvre of sculptors and painters, whose work has emerged from something digital. For example, the art objects by Shawn Smith and Douglas Coupland are made using pixel-blocks (the sculpture by the latter is indeed reminiscent of a low resolution digital image). Analogous works induce confusion, not to say a surprising experience, in the minds of the audience, for they bring the virtual quality of the computerised environment into physical surroundings. This makes the artworks appear odd and surreal, like some sort of mistake, errors, images and objects out of place.

The so-called postinternet generation artists are certainly not the only ones making this kind of art. As an example of this, there is a reference to the abstract stained glass collage of 11,500 pixels by Gerhard Richter in the Cologne Cathedral. It is supposed to be a reference to his 1974 painting "4096 Farben" (4096 colours), which indeed is quite similar. It is said that Richter did not accept a fee; however, the material costs were covered by donations. And yet the cardinal did not come to the opening of the glasswork, preferring depictions of Christian martyrs over abstract windows, which instead reminded him of mosques.

One could name other such examples inspired by the digital world or schisms of the digital and physical world: Helmut Smits' "Dead Pixel in Google Earth" (2008); Aram Barholl's "Map" (2006); the projects by Eva and Franco Mattes, especially the printouts of Second Life avatars from 2006; Achim Mohné's and Uta Koppi's project "Remotewords" (2007–2011), computer-based instructions printed on rooftops to be seen from Google Maps or satellites or planes. There are countless examples where it is hard to discern whether the artist is deliberately and critically minded towards digital art or rather a representative of the post-digital generation who is not aware and wishes not to be part of the history of digital art.

From the point of view of researchers of digital culture, the so-called media-archaeological direction could be added to this as an inspirational source for artists today. Media archaeology or the examination of previous art and cultural experience signifies, in relation to contemporary media machines and practices, the exploration of previous non-digital cultural devices, equipment, means of communication, and so on, that could be regarded as the pre-history of today's digital culture and digital devices. With this point of view, the "media-archaeological" artworks of Toshio Iwai or Bernie Lubell coalesce. They have taken an earlier "media machine" or a scientific or technical device and created a modern creation on the basis of it.

Then there was the "Ars Electronica" festival (2006) that focused on the umbrella topic "Simplicity", which in a way turned its back on the "complexity" of digital art and returned to the physical space.

Therefore, in the context of digital media based art trends, the last couple of decades have seen many expressions – works, events and exhibitions – of "turning away" from the digital environment that would outwardly qualify as post-digital and postinternet art.

Kelomees-Art of the Digital Natives and Predecessors of Post-Internet Art-212_a.pdf
Kelomees-Art of the Digital Natives and Predecessors of Post-Internet Art-212_c.pdf

3:15pm - 3:30pm
Short Paper (10+5min) [abstract]

The Stanley Rhetoric: A Procedural Analysis of VR Interactions in 3D Spatial Environments of Stanley Park, BC

Raluca Fratiloiu

Okanagan College

In a seminal text on the language of new media, Manovitch (2002) argued:

Traditionally, texts encoded human knowledge and memory, instructed, inspired, convinced, and seduced their readers to adopt new ideas, new ways of interpreting the world, new ideologies. In short, the printed word was linked to the art of rhetoric. While it is probably possible to invent a new rhetoric of hypermedia […] the sheer existence and popularity of hyperlinking exemplifies the continuing decline of the field of rhetoric. (Manovitch, 2002).

Depending on the context of each “rhetorical situation” (Bitzer, 1968), it may be both good and bad news to think that interactivity and rhetoric might not always go hand in hand. However, despite the anticipated decline of rhetoric as announced by Manovitch (2002), in this paper we propose a closer examination of what constitutes a rhetorically effective discourse in new media, in general and virtual reality (VR), in particular. The reason we need to examine it more closely is that VR, especially when it has an educational goal, needs to be rhetorically effective to be successful with audiences. A consideration of the rhetorical impact of VR’s affordances may enhance the potential of meaningful interactions with students and users.

In addition to a very long disciplinary history, rhetoric has been investigated in relation to new media mainly through Bogost’s (2007) concept of “procedural rhetoric”. He argued that despite the fact that “rhetoric was understood as the art of oratory”, “videogames open a new domain for persuasion, thanks to their core representational mode, procedurality. (Bogost, 2007) This has implications, according to Bogost (2007) in three areas: politics, advertising and learning. Several of these implications have already been investigated. Procedural rhetorical analysis in videogames has since become a core methodological approach. Also, particular attention has been paid to how new media open new possibilities through play and how in turn this creates a renewed interest in digital rhetoric. (Daniel-Wariya, 2016) At the same time, procedural rhetoric has been also investigated at length in connection to learning through games (Gee, 2007). Learning also has been central in a few studies on VR in education (Dalgarno, 2010). However, specific assessments of procedural rhetoric outcomes of particular VR educational projects are non-existent.

In this paper, we will focus on analysing procedural interactions in a VR project developed by University of British Columbia’s Emerging Media Lab. This project, funded via an open education grant, led to the creation of a 3D spatial environment of Stanley Park located in Vancouver, British Columbia (BCCampus, 2017). This project focused on Stanley Park, one of the most iconic Canadian destinations as an experiential field trip, specifically using educational content for and 3D spatial environment models of Prospect Point, Beaver Lake, Lumberman’s Arch, and the Hollow Tree. Students will have opportunities to visit these locations in the park virtually and interact with the environment and remotely with other learners. In addition, VR provides opportunities to explore the complex history of this impressive location that was once home to Burrard, Musqueam and Squamish First Nations people (City of Vancouver, 2017).

This case analysis may open up new possibilities for investigating how students/users derive meaning from interacting in these environments and continue a dialogue between several connected areas of education and VR, games and pedagogy, games and procedural rhetoric. Also, we hope to contribute this feedback to this emerging project as it continues to evolve and share its results with the wider open education community.


BCCampus. (2017, May 10). Virtual reality and augmented reality field trips funded by OER grants. Retrieved from BCCampus:

Bitzer, L. (1968). The rhetorical situation. Philosophy and Rhetoric, 1, pp. 1-14.

Bogost, I. (2007). Persuasive Games: The Expressive Power of Videogames. . Cambridge: MA: MIT Press.

City of Vancouver. (2017). The History of Stanley Park. Retrieved from City of Vancouver:

Dalgarno, L. (2010). What are the learning affordances of 3-D virtual environments? British Journal of Educational Technology, Vol 41 No 1 10-32.

Daniel-Wariya, J. (2016). A Language of Play: New Media’s Possibility Spaces. Computers and Composition, 40, pp 32-47.

Gee, J. P. (2007). What Video Games Have to Teach Us About Learning and Literacy. Second Edition: Revised and Updated Edition. New York: St. Martin's Griffin.

Manovitch, L. (2002). The language of new media. Cambridge, MA: MIT Press.

Fratiloiu-The Stanley Rhetoric-122_a.pdf
Fratiloiu-The Stanley Rhetoric-122_c.pdf
2:00pm - 3:30pmT-PIII-2: Language Resources
Session Chair: Kaius Sinnemäki
2:00pm - 2:30pm
Long Paper (20+10min) [publication ready]

Sentimentator: Gamifying Fine-grained Sentiment Annotation

Emily Sofi Öhman, Kaisla Kajava

University of Helsinki

We introduce Sentimentator; a publicly available gamified web-based annotation platform for fine-grained sentiment annotation at the sentence-level. Sentimentator is unique in that it moves beyond binary classification. We use a ten-dimensional model which allows for the annotation of over 50 unique sentiments and emotions. The platform is heavily gamified with a scoring system designed to reward players for high quality annotations. Sentimentator introduces several unique features that have previously not been available, or at best very limited, for sentiment annotation. In particular, it provides streamlined multi-dimensional annotation optimized for sentence-level annotation of movie subtitles. The resulting dataset will allow for new avenues to be explored, particularly in the field of digital humanities, but also knowledge-based sentiment analysis in general. Because both the dataset and platform will be made publicly available it will benefit anyone and everyone interested in fine-grained sentiment analysis and emotion detection, as well as annotation of other datasets.


2:30pm - 2:45pm
Distinguished Short Paper (10+5min) [publication ready]

Defining a Gold Standard for a Swedish Sentiment Lexicon: Towards Higher-Yield Text Mining in the Digital Humanities

Jacobo Rouces, Lars Borin, Nina Tahmasebi, Stian Rødven Eide

University of Gothenburg

There is an increasing demand for multilingual sentiment analysis, and most work on sentiment lexicons is still carried out based on English lexicons like WordNet. In addition, many of the non-English sentiment lexicons that do exist have been compiled by (machine) translation from English resources,

thereby arguably obscuring possible language-specific characteristics of sentiment-loaded vocabulary.

In this paper we describe the creation of a gold standard for the sentiment annotation of Swedish terms as a first step towards the creation of a full-fledged sentiment lexicon for Swedish -- i.e., a lexicon containing information about prior sentiment (also called polarity) values of lexical items (words or disambiguated word senses), along a scale negative--positive. We create a gold standard for sentiment annotation of Swedish terms, using the freely available SALDO lexicon and the Gigaword corpus. For this purpose, we employ a multi-stage approach combining corpus-based frequency sampling and two stages of human annotation: direct score annotation followed by Best-Worst Scaling. In addition to obtaining a gold standard, we analyze the data from our process and we draw conclusions about the optimal sentiment model.

Rouces-Defining a Gold Standard for a Swedish Sentiment Lexicon-209_a.pdf
Rouces-Defining a Gold Standard for a Swedish Sentiment Lexicon-209_c.pdf

2:45pm - 3:00pm
Short Paper (10+5min) [publication ready]

The Nordic Tweet Stream: A dynamic real-time monitor corpus of big and rich language data

Mikko Laitinen1, Jonas Lundberg2, Magnus Levin3, Rafael Martins4

1University of Eastern Finland; 2Linnaeus University; 3Linnaeus University; 4Linnaeus University

This article presents the Nordic Tweet Stream (NTS), a cross-disciplinary corpus project of computer scientists and a group of sociolinguists interested in language variability and in the global spread of English. Our research integrates two types of empirical data: We not only rely on traditional structured corpus data but also use unstructured data sources that are often big and rich in metadata, such as Twitter streams. The NTS downloads tweets and associated metadata from Denmark, Finland, Iceland, Norway and Sweden. We first introduce some technical aspects in creating a dynamic real-time monitor corpus, and the fol-lowing case study illustrates how the corpus could be used as empirical evidence in sociolinguistic studies focusing on the global spread of English to multilingual settings. The results show that English is the most frequently used language, accounting for almost a third. These results can be used to assess how widespread English use is in the Nordic region and offer a big data perspective that complement previous small-scale studies. The future objectives include annotating the material, making it available for the scholarly community, and expanding the geographic scope of the data stream outside Nordic region.

Laitinen-The Nordic Tweet Stream-201_a.pdf
Laitinen-The Nordic Tweet Stream-201_c.pdf

3:00pm - 3:15pm
Short Paper (10+5min) [publication ready]

Best practice for digitising small-scale Digital Humanities projects

Peggy Bockwinkel, Dîlan Cakir

University of Stuttgart, Germany

Digital Humanities (DH) are growing rapidly; the necessary infrastructure

is being built up gradually and slowly. For smaller DH projects, e. g. for

testing methods, as a preliminary work for submitting applications or for use in

teaching, a corpus often has to be digitised. These small-scale projects make an

important contribution to safeguarding and making available cultural heritage, as

they make it possible to machine read those resources that are of little or no interest

to large projects because they are too special or too limited in scope. They

close the gap between large scanning projects of archives, libraries or in connection

with research projects and projects that move beyond the canonised paths.

Yet, these small projects can fail in this first step of digitisation, because it is

often a hurdle for (Digital) Humanists at universities to get the desired texts digitised:

either because the digitisation infrastructure in libraries/archives is not

available (yet) or it is paid service. Also, researchers are often no digitising experts

and a suitable infrastructure at university is missing.

In order to promote small DH projects for teaching purposes, a digitising infrastructure

was set up at the University of Stuttgart as part of a teaching project. It

should enable teachers to digitise smaller corpora autonomously.

This article presents a study that was carried out as part of this teaching project.

It suggests how to implement best practices and on which aspects of the digitisation

workflow need to be given special attention.

The target group of this article are (Digital) Humanists who want to digitise a

smaller corpus. Even with no expertise in scanning and OCR and no possibility

to outsource the digitisation of the project, they still would like to obtain the best

possible machine-readable files.

Bockwinkel-Best practice for digitising small-scale Digital Humanities projects-254_a.pdf
Bockwinkel-Best practice for digitising small-scale Digital Humanities projects-254_c.pdf

3:15pm - 3:30pm
Distinguished Short Paper (10+5min) [publication ready]

Creating and using ground truth OCR sample data for Finnish historical newspapers and journals

Kimmo Kettunen, Jukka Kervinen, Mika Koistinen

University of Helsinki, Finland,

The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12 million pages mainly in Finnish and Swedish. Out of these about 5.1 million pages are freely available on the web site The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1920. The last ten years, 1911–1920, were opened in February 2017.

This paper presents the ground truth Optical Character Recognition data of about 500 000 Finnish words that has been compiled at NLF for development of a new OCR process for the collection. We discuss compilation of the data and show basic results of the new OCR process in comparison to current OCR using the ground truth data.

Kettunen-Creating and using ground truth OCR sample data for Finnish historical newspapers and journals-115_a.pdf
Kettunen-Creating and using ground truth OCR sample data for Finnish historical newspapers and journals-115_c.pdf
2:00pm - 3:30pmT-PIV-2: Authorship
Session Chair: Jani Marjanen
2:00pm - 2:30pm
Long Paper (20+10min) [abstract]

Extracting script features from a large corpus of handwritten documents

Lasse Mårtensson1, Anders Hast2, Ekta Vats2

1Högskolan i Gävle, Sweden,; 2Uppsala universitet, Sweden

Before the advent of the printing press, the only way to create a new piece of text was to produce it by hand. The medieval text culture was almost exclusively a handwritten one, even though printing began towards the very end of the Middle Ages. As a consequence of this, the medieval text production is very much characterized by variation of various kinds: regarding language forms, regarding spelling and regarding the shape of the script. In the current presentation, the shape of the script is in focus, an area referred to as palaeography. The introduction of computers has changed this discipline radically, as computers can handle very large amounts of data and furthermore measure features that are difficult to deal with for a human researcher.

In the current presentation, we will demonstrate two investigations within digital palaeography, carried out on the medieval Swedish charter corpus in its entirety, to the extent that this has been digitized. The script in approximately 14 000 charters has been measured and accounted for, regarding aspects described below. The charters are primarily in Latin and Old Swedish, but there are also a few in Middle Low German. The overall purpose for the investigations is to search for script features that may be significant from the perspective of separating one scribe from another, i.e. scribal attribution. As the investigations have been done on the entire available charter corpus, it is possible to visualize how each separate charter relates to all the others, and furthermore to see how the charters may divide themselves into clusters on the basis of similarity regarding the investigated features.

The two investigations both focus on aspects that have been looked upon as significant from the perspective of scribal attribution, but that are very difficult to measure, at least with any degree of precision, without the aid of computers. One of the investigations belongs to a set of methods often referred to as Quill Features. This method focuses, as the name states, on how the scribe has moved the pen over the script surface (parchment or paper). The medieval pen, the quill, consisted of a feather that had been hardened, truncated and split at the top. This construction created variation in width in the strokes constituting the script, mainly depending on the direction in which the pen was moved, and also depending on the angle in which the scribe had held the pen. This is what this method measures: the variation between thick and thin strokes, in relation to the angle of the pen. This method has been used on medieval Swedish material before, namely a medieval Swedish manuscript (Cod. Ups. C 61, 1104 pages), but the current investigation accounts for ten times the size of the previous investigation, and furthermore, we employ a new type of evaluation (see below) of the results that to our knowledge has not been done before.

The second investigation focuses on the relations between script elements of different height, and the proportions between these. For instance three different formations can be discerned among the vertical scripts elements: minims (e.g. in ‘i’, ‘n’ and ‘m’), ascenders (e.g. in ‘b’, ‘h’ and ‘k’) and descenders (e.g. in ‘p’ and ‘q’). The ascender can extend to a various degree above the minim, and the descender can extend to a various degree below the minim, creating different proportions between the components. These measures have also been extracted from the entire available medieval Swedish charter corpus, and display very interesting information from the perspective of scribal identity. It should be noted that the first line of a charter often is divergent from the rest of the charter in this respect, as the ascenders here often extends higher than otherwise. In a similar way, the descenders of the last line of the charters often extend further down below the line as compared to the rest of the charter. In order for a representative measure to be gained from a charter, these two lines must be disregarded.

One of the problems when investigating individual scribal habits in medieval documents is that we rarely know for certain who has produced them, which makes the evaluation difficult. In most cases, the scribe of a given document is identified through a process of scribal attribution, usually based on palaeographical and linguistic evidence. In an investigation on individual scribal features, it is not desirable to evaluate the results on the basis of previous attributions. Ideally, the evaluation should be done on charters where the identity of the scribe can be established on external features, where his/her identity is in fact known. For this purpose, we have identified a set of charters where this is actually the case, namely where the scribe himself/herself explicitly states that he/she has held the pen (in our corpus, there are only male scribes). These charters contain a so-called scribal note, containing the formula ego X scripsi (‘I X wrote’), accompanied by a symbol unique to this specific scribe. One such scribe is Peter Tidikesson, who produced 13 charters with such a scribal note in the period 1432–1452, and another is Peter Svensson, who produced six charters in the period 1433–1453. This selection of charters is the means by which the otherwise big data-focused computer aided methods can be evaluated from a qualitative perspective. This step of evaluation is crucial in order for the results to become accessible and useful for the users of the information gained.

Mårtensson-Extracting script features from a large corpus of handwritten documents-250_a.pdf

2:30pm - 3:00pm
Long Paper (20+10min) [abstract]

Text Reuse and Eighteenth-Century Histories of England

Ville Vaara1, Aleksi Vesanto2, Mikko Tolonen1

1University of Helsinki; 2University of Turku


- ----

What kind of history is Hume’s History of England? Is it an impartial account or is it part of a political project? To what extent was it influenced by seventeenth-century Royalist authors? These questions have been asked since the first Stuart volumes were published in the 1750s. The consensus is that Hume’s use of Royalist sources left a crucial mark on his historical project. However, as Mark Spencer notes, Hume did not only copy from Royalists or Tories. One aim of this paper is to weigh these claims against our evidence about Hume’s use of historical sources. To do this we qualified, clustered and compared 129,646 instances text reuse in Hume’s History. Additionally, we are able to compare Hume’s History of England to other similar undertakings in the eighteenth-century and get an accurate view of their composition. We aim to extend the discussion on Hume's History in the direction of applying computation methods on understanding the writing of history of England in the eighteenth-century as a genre.

This paper contributes to the overall development of Digital Humanities by demonstrating how digital methods can help develop and move forward discussion in an existing research case. We don’t limit ourselves to general method development, but rather contribute in the specific discussions on Hume’s History and study of eighteenth-century histories.

Methods and sources

- ----

This paper advances our understanding of the composition of Hume’s History by examining the direct quotes in it based on data in Eighteenth-Century Collections Online (ECCO). It should be noted that ECCO also includes central seventeenth-century histories and other important documents reprinted later. Thus, we do not only include eighteenth-century sources, but, for example, Clarendon, Rushworth and other notable seventeenth-century historians. We compare the phenomenon of text reuse in Hume’s History to that in works of Rapin, Guthrie and Carte, all prominent historians at the time. To our knowledge, this kind of text mining effort has not been not been previously done in the field of historiography.

Our base-text for Hume is the 1778 edition of History of England. For Paul de Rapin we used the 1726-32 edition of his History of England. For Thomas Carte the source was the 1747-1755 edition of his General History of England. And for William Guthrie we used the 1744-1751 edition of his History of Great Britain.

As a starting point for our analysis, we used a dataset of linked text-reuse fragments found in ECCO. The basic idea was to create a dataset that identifies similar sequences of characters (from circa 150 to more than 2000 characters each) instead of trying to match individual characters or tokens/words. This helped with the optical character recognition problems that plague ECCO. The methodology has previously been used in matching DNA sequences, where the problem of noisy data is likewise present. We further enriched the results with bibliographical metadata from the English Short Title Catalogue (ESTC). This enriching allows us to compare the publication chronology and locations, and to create rough estimates of first edition publication dates.

There is no ready-to-use gold standard for text reuse cluster detection. Therefore, we compared our clusters and the critical edition of the Essay Concerning Human Understanding (EHU) to see if text reuse cases of Hume’s Treatise in EHU are also identified by our method. The results show that we were able to identify all cases included in EHU except those in footnotes. Because some of the changes that Hume made from the Treatise to EHU are not evident, this is a very promising.


- ----

To give a general overview of Hume’s History in relation to other works considered, we compared their respective volumes of source text reuse (figure 1). The comparison reveals some fundamental stylistic and structural differences. Hume’s and Carte’s Histories are composed quite differently from Rapin’s and Guthrie’s, which have roughly three times more reused fragments: Rapin typically opens a chapter with a long quote from a source document, and moves on to discuss the related historical events. Guthrie writes similarly, quoting long passages from sources of his choice. Humeis different: His quotes are more evenly spread, and a greater proportion of the text seems to be his own original formulations.

[Figure 1.]

Change in text reuse in the Histories

- ----

All the histories of England considered in our analysis are massive works, comprising of multiple separate volumes. The amount of reused text fragments found in these volumes differs significantly, but the trends are roughly similar. The common overall feature is a rise in the frequency of direct quotes in later volumes.

The increase in text reuse peaks in the volumes covering the reign of Charles I, and the events of the English Civil War, but with respect to both Hume and Rapin (figures 2 & 3), the highest peak is not at the end of Charles’ reign, but in the lead up to the confrontation with the parliament. In Guthrie and Carte (figures 4 & 5) the peaks are located in the final volume. Except for Guthrie, all the other historical works considered here have the highest reuse rates located around the period of Charles I’s reign that was intensely debated topic among Hume’s contemporaries.

[Figure 2.]

[Figure 3.]

[Figures 4, 5.]

We can further break down the the sources of reused text fragments by political affiliation of their authors (figure 6). A significant portion of the detected text reuse cases by Hume link to authors with no strong political leaning in the wider Whig-Tory context. It is obvious that serious antiquary work that is politically neutral forms the main body of seventeenth-century historiography in England. With the later volumes, the amount of text reuses cases tracing back to authors with a political affiliation increases, as might be expected with more heavily politically loaded topics.

[Figure 6.]

Taking an overview of the authors of the text reuse fragments in Hume’s History (figure 7), we note that the statistics are dominated by a handful of writers, with a long “tail” of others whose use is limited to a few fragments. Both groups, the Whig and Tory authors, feature a few “main sources” for Hume. John Rushworth (1612-1690) emerges as the most influential source, followed closely by Edward Hyde Clarendon (1609-1674). Both Rushworth and Clarendon had reached a position of prominence as historians and were among the best known and respected sources available when Hume was writing his own work. We might even question if their use was politically colored at all, as practically everyone was using their works, regardless of political stance.

[Figure 7.]

Charles I execution and Hume’s impartiality

- ----

A relatively limited list of authors are responsible for majority of the text fragments in Hume's History. As one might intuitively expect, the use of particular authors is concentrated in particular chapters. In general, the unevenness in the use of quotes can be seen as more of a norm than an exception.

However, there is at least one central chapter in Hume’s Stuart history that breaks this pattern. That is, Chapter LIX - perhaps the most famous chapter in the whole work, covering the execution of Charles I. Nineteenth-century Whig commentators argued, with great enthusiasm, that Hume’s use of sources, especially in this particular chapter, and Hume’s description of Charles’s execution, followed Royalist sources and the Jacobite Thomas Carte in particular. Thus, more carefully balanced use of sources in this particular chapter reveals a clear intention of wanting to be (or appear to be) impartial on this specific topic (figure 8).

Of course, there is John Stuart Mill’s claim that Hume only uses Whigs when they support his Royalist bias. In the light of our data, this seems unlikely. If we compare Hume's use of Royalist sources in his treatment of the execution of Charles I to Carte, Carte’s use of Royalists, statistically, is off the chart whereas Hume’s is aligned with his use of Tory sources elsewhere in the volume.

[Figure 8.]

Hume’s influence on later Histories

- ----

A final area of interest in terms of text reuse is what it can tell us about an author’s influence on later writers. The reuse totals of Hume’s History in works following its publication are surprisingly evenly spread out over all the volumes (figure 9), and in this respect differ from the other historians considered here (figures 10 - 12). The only exception is the last volume where a drop in the amount of detected reuse fragments can be considered significant.

Of all the authors only Hume has a significant reuse arising from the volumes discussing the Civil War. The reception of Hume’s first Stuart volume, the first published volume of his History is well known. It is notable that the next volumes published, that is the following Stuart volumes, and possibly written with the angry reception of the first Stuart volume in mind, are the ones that seem to have given rise to least discussion.

[Figure 9.]

[Figure 10.]

[Figures 11 & 12.]


- ----

Original sources

- ----

Eighteenth-century Collections Online (GALE)

English Short-Title Catalogue (British Library)

Thomas Carte, General History of England, 4 vols., 1747-1755.

William Guthrie, History of Great Britain, 3 vols., 1744-1751.

David Hume, History of England, 8 vols., 1778.

David Hume, Enquiry concerning Human Understanding, ed. Tom L. Beauchamp, OUP, 2000.

Paul de Rapin, History of England, 15 vols., 1726-32.

Secondary sources

- ----

Herbert Butterfield, The Englishman and his history, 1944.

John Burrow, Whigs and Liberals: Continuity and Change in English Political Thought, 1988.

Duncan Forbes, Hume’s Philosophical Politics, Cambridge, 1975.

James Harris, Hume. An intellectual biography, 2015.

Colin Kidd, Subverting Scotland's Past. Scottish Whig Historians and the Creation of an Anglo-British Identity 1689–1830, Cambridge, 1993.

Royce MacGillivray, ‘Hume's "Toryism" and the Sources for his Narrative of the Great Rebellion’, Dalhousie Review, 56, 1987, pp. 682-6.

John Stuart Mill, ‘Brodie’s History of the British Empire’, Robson et al. ed. Collected works, vol. 6, pp. 3-58. (

Ernest Mossner, "Was Hume a Tory Historian?’, Journal of the History of Ideas, 2, 1941, pp. 225-236.

Karen O’Brien, Narratives of Enlightenment: Cosmopolitan History from Voltaire to Gibbon, CUP, 1997.

Laird Okie, ‘Ideology and Partiality in David Hume's History of England’, Hume Studies, vol. 11, 1985, pp. 1-32.

Frances Palgrave, ‘Hume and his influence upon History’ in vol. 9 of Collected Historical Works, e.d R. H. Inglis Palgrave, 10 vols. CUP, 1919-22.

John Pocock, Barbarism and religion, vols. 1-2.

B. A. Ring, ’David Hume: Historian or Tory Hack?’, North Dakota Quarterly, 1968, pp. 50-59.

Claudia Schmidt, Reason in history, 2010.

Mark Spencer, ‘David Hume, Philosophical Historian: “contemptible Thief” or “honest and industrious Manufacturer”?, Hume conference, Brown, 2017.

Vesanto, Nivala, Salakoski, Salmi & Ginter: A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora. Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24. May 2017, Gothenburg, Sweden. (

Vaara-Text Reuse and Eighteenth-Century Histories of England-222_a.pdf
Vaara-Text Reuse and Eighteenth-Century Histories of England-222_c.pdf

3:00pm - 3:30pm
Long Paper (20+10min) [abstract]

Refutatio errorum – authorship attribution on a late-medieval antiheretical treatise

Reima Välimäki

University of Turku, Cultural history

Refutatio errorum – authorship attribution on a late-medieval antiheretical treatise.

Since Peter Biller’s attribution of the Cum dormirent homines (1395) to Petrus Zwicker, perhaps the most important late medieval inquisitor prosecuting Waldensians, the treatise has become a standard source on the late medieval German Waldensianism. There is, however, another treatise, known as the Refutatio errorum, which has gained far less attention. In my dissertation (2016) I proposed that similarities in style, contents, manuscript tradition and composition of the Refutatio errorum and the Cum dormirent homines are so remarkable that Petrus Zwicker can be confirmed as the author of both texts. The Refutatio exists in four different redactions. However, the redaction edited by J. Gretser in the 17th century, and consequently used by modern scholars, does not correspond to the earlier and more popular redaction that is in the majority of preserved manuscripts.

In the proposed paper I will add a new element of verification to Zwicker’s authorship: machine-learning-based computational authorship attribution applied in the digital humanities consortium Profiling Premodern Authors (University of Turku, 2016–2019). In its simplest form, the authorship attribution is a binary classification task based on textual features (word uni/bi-grams, character n-grams). In our case, the classifications are “Petrus Zwicker” (based on features from his known treatise) and “not-Zwicker”, based on features from a background corpus consisting of medieval Latin polemical treatises, sermons and other theological works. The test cases are the four redactions of the Refutatio errorum. Classifiers used include a linear Support Vector Machine and a more complex Convolutional Neural Network. Researchers from the Turku NLP group (Aleksi Vesanto, Filip Ginter, Sampo Pyysalo) are responsible for the computational analysis.

The paper contributes to the conference theme History. It aims to bridge the gap between authorship attribution based on qualitative analysis (e.g. contents, manuscript tradition, codicological features, palaeography) and computational stylometry. Computational methods are treated as one tool that contributes to the difficult task of recognising authorship in a medieval text. The study of author profiles of four different redactions of a single work contributes to the discussions on scribes, secretaries and compilers as authors of medieval texts (e.g. Reiter 1996, Minnis 2006, Connolly 2011, Kwakkel 2012, De Gussem 2017).


Biller, Peter. “The Anti-Waldensian Treatise Cum Dormirent Homines of 1395 and its Author.” In The Waldenses, 1170-1530: Between a Religious Order and a Church, 237–69. Variorum Collected Studies Series. Aldershot: Ashgate, 2001.

Connolly, Margaret. “Compiling the Book.” In The Production of Books in England 1350-1500, edited by Alexandra Gillespie and Daniel Wakelin, 129–49. Cambridge Studies in Palaeography and Codicology 14. Cambridge ; New York: Cambridge University Press, 2011.

De Gussem, Jeroen. “Bernard of Clairvaux and Nicholas of Montiéramey: Tracing the Secretarial Trail with Computational Stylistics.” Speculum 92, no. S1 (2017): S190–225.

Kwakkel, Erik. “Late Medieval Text Collections. A Codicological Typology Based on Single-Author Manuscripts.” In Author Reader Book: Medieval Authorship in Theory and Practice, edited by Stephen Partridge and Erik Kwakkel, 56–79. Toronto: University of Toronto Press, 2012.

Reiter, Eric H. “The Reader as Author of the User-Produced Manuscript: Reading and Rewriting Popular Latin Theology in the Late Middle Ages.” Viator 27, no. 1 (1996): 151–70.

Minnis, A. J. “Nolens Auctor Sed Compilator Reputari: The Late-Medieval Discourse of Compilation.” In La Méthode Critique Au Moyen Âge, edited by Mireille Chazan and Gilbert Dahan, 47–63. Bibliothèque d’histoire Culturelle Du Moyen âge 3. Turnhout: Brepols, 2006.

Välimäki, Reima. “The Awakener of Sleeping Men. Inquisitor Petrus Zwicker, the Waldenses, and the Retheologisation of Heresy in Late Medieval Germany.” PhD Thesis, University of Turku, 2016.

Välimäki-Refutatio errorum – authorship attribution on a late-medieval antiheretical treatise-195_a.pdf
2:00pm - 3:30pmT-P674-2: Crowdsourcing and Collaboration
Session Chair: Hannu Salmi
2:00pm - 2:30pm
Long Paper (20+10min) [abstract]

From crowdsourcing cultural heritage to citizen science: how the Danish National Archives 25-year old transcription project is meeting digital historians

Barbara Revuelta-Eugercios1,2, Nanna Floor Clausen1, Katrine Tovgaard-Olsen1

1Rigsarkivet (Danish National Archives); 2Saxo Institute, University of Copenhagen

The Danish National Archives have the oldest crowdsourcing project in Denmark, with more than 25 million records transcribed that illuminate the live and deaths of Danes since the early 18th century. Until now, the main group interested in creating and using these resources have been amateur historians and genealogists. However, it has become clear that the material also holds immense value to historians, armed with the new digital methods. The rise of citizen science projects show, likewise, an alternative way, with clear research purposes, of using the crowdsourcing of cultural heritage material. How to reconcile the traditional crowd-centered approach of the existing projects, to the extent that we can talk about co-creation, with the narrowly-defined research questions and methodological decisions researchers required? How to increase the use of these materials by digital historians without losing the projects’ core users?

This article articulates how the Danish National Archives are answering these questions. In the first section, we discuss the tensions and problems of combining crowdsourcing digital heritage and citizen science; in the second, the implications of the crowd-centered nature of the project in the incorporation of research interests; and in the third one, we present the obstacles and solutions put in place to successfully attract digital historians to work on this material.

Crowdsourcing cultural heritage: for the public and for the humanists

In the last decades, GLAMs (galleries, libraries, archives and museums) have been embarked in digitalization projects to broaden the access, dissemination and appeal of their collections, as well as enriching them in different ways (tagging, transcribing, etc.), as part of their institutional missions. Many of these efforts have included audience or community participation, which can be loosely defined as either crowdsourcing or activities that predate or conform to the standard definition of crowdsourcing, taking Howe’s (2006) business-related definition as “the act of taking a job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call” (Ridge 2014). However, the key feature that differentiates these crowdsourcing cultural heritage projects is that the work the crowd performs has never been undertaken by employees. Instead, they co-create new ways for the collections to be made available, disseminated, interpreted, enriched and enjoyed that could never had been paid for within their budgets.

These projects often feature “the crowd” at both ends of the process: volunteers contribute to improve access to and availability of the collections, which in turn will benefit the general public from which volunteers are drawn. In the process, access to the digital cultural heritage material is democratized and facilitated, transcribing records, letters, menus, tagging images, digitizing new material, etc. As a knock-on effect, the research community can also benefit, as the new materials open up possibilities for researchers in the digital humanities. The generally financially limited Humanities projects could never achieve the transcription of millions of records.

At the same time, there has been a strand of academic applications of crowdsourcing in Humanities projects (Dunn and Hedges 2014). These initiatives fall within the so-called citizen science projects, which are driven by researchers and narrowly defined to answer a research question, so the tasks performed by the volunteers are lined up to a research purpose. Citizen science or public participation on scientific research, that emerged out of natural sciences projects in the mid-1990s (Bonney et al 2009), has branched out to meet the Humanities, building on a similar utilization of the crowd, i.e. institutional digitalization projects of cultural heritage material. In particular, archival material has been a rich source for such endeavours: weather observations from ship logs in Old Weather (Blaser 2014), Benthan’s works in Transcribe Bentham (Causer & Terras 2014) or restaurant menus on What’s on the menu (2014). While some of them have been carried out in cooperation with the GLAMs responsible for those collections, the new opportunities opened up for the digital humanities allow these projects to be carried out by researchers independently from the institutions that host the collections, missing a great opportunity to combine interests and avoid duplicating work.

Successfully bringing a given project to contribute to crowdsourcing cultural heritage material and citizen science faces many challenges. First, a collaboration needs to be established across at least two institutional settings – a GLAMs and a research institution- that have very different institutional aims, funding, culture and legal frameworks. GLAMs foundational missions often relate to serving the public in general first, the research community being only a tiny percentage of its users. Any institutional research they undertake on the collections is restricted to particular areas or aspects of the collections and institutional interest which, on the other hand, is less dependent on external funding. The world of Academia, on the other hand, has a freer approach to formulating research questions but is often staffed with short-term positions and projects, time-constraints and a need of immediacy of publication and the ever-present demand for proving originality and innovation.

Additionally, when moving from cultural heritage dissemination to research applications, a wide set of issues also come into view in these crowdsourcing works that can determine their development and success: the boundaries between professional and lay expertise, the balance of power in the collaboration between the public, institutions and researchers, ethical concerns in relation to data quality and data property, etc. (Riesh 2014, Shirk et al 2012).

The Danish National Archives crowd-centered crowdsourced 25-year-old approach

In this context, the Danish National Archives are dealing with the challenge of how to incorporate a more citizen-science oriented approach and attract historians (and digital humanists) to work with the existing digitized sources while maintaining its commitment to the volunteers. This challenge is of a particular difficulty in this case because not only the interests of the archives and researchers need to align, but also those of the “crowd” itself, as volunteers have played a major role in co-creating crowdsourcing for 25 years.

The original project, now the Danish Demographic Database, DDD, (, is the oldest “crowdsourcing project” in the country. It started in 1992 thanks to the interest of the genealogical communities in coordinating the transcription of historical censuses and church books. (Clausen & Jørgensen 2000). From its beginning, the volunteers were actively involved in the decision-making process of what was to be done and how, while the Danish National Archives (Rigsarkivet) were in charge of coordination, management and dissemination functions. Thus, there has been a dual government of the project and a continuous conversation of negotiation of priorities, in the form of, a coordination committee, which combines members of the public and genealogical societies as well as Rigsarkivet personel.

This tradition of co-creation has shaped the current state of the project and its relationship to research. The subsequent Crowdsourcing portal, CS, (, which started in 2014 with a online interface, broadened the sources under transcription and the engagement with volunteers (in photographing, counselling, etc.), and maintains a strong philosophy of serving the volunteers’ wishes and interests, rather than imposing particular lines. Crowdsourcing is seen as more than a framework for creating content: it is also a form of engagement with the collections that benefits both audiences and archive. However, it has also introduced some citizen-science projects, in which the transcriptions are intended to be used for research (e.g. the Criminality History project).

Digital history from the crowdsourced material: present and future

In spite of that largely crowd-oriented nature of this crowdsourcing project, there were also broad research interests (if not a clearly defined research project) behind the birth of DDD, so that the decisions taken in its setup ensured that the data was suitable for research. Dozens of projects and publications have made use of it, applying new digital history methods, and the data has been included in international efforts, such as the North Atlantic Population Project (

However, while amply known in genealogist and amateur historian circles, the Danish National Archives large crowdsourcing projects are still either unknown or taken advantage of by historians and students in the country. Some of the reasons are related to field-specific developments, but one of the key constraints towards a wider use is, undoubtedly, the lack of adequate training. There is no systematic training for dealing with historical data or digital methods in the History degrees, even when we are witnessing a clear increase in the digital Humanities.

In this context, the Danish National Archives are trying to push their material into the hands of more digital historians, building bridges to the Danish universities by multiple means: collaboration with universities in seeking joint research projects and applications (SHIP and Link Lives project); active dissemination of the material for educational purposes across disciplines (Supercomputer Challenge at Southern Denmark University ); addressing the lack of training and familiarity of students and researchers with it through targeted workshops and courses, including training in digital history methods (Rigsarkivets Digital History Labs); and promotion of an open dialogue with researchers to identify more sources that could combine the aims of access democratization and citizen science.


Blaser, L., 2014 “Old Weather: approaching collections from a different angle” in Ridge (ed) Crowdsourcing our Cultural Heritage, Ashgate, 45-56.

Bonney et al. 2009. Public Participation in Scientific Research: Defining the Field and Assessing Its Potential for Informal Science Education. Center for Advancement of Informal Science Education (CAISE), Washington, DC

Clausen, N.C and Marker, H.J., 2000, ”The Danish Data Archive” in Hall, McCall, Thorvaldsen International historical microdata for population research, Minnesota Population Center . Minneapolis, Minnesota, 79-92,

Causer, T. and Terras, M. 2014, ”‘Many hands make light work. Many hands together make merry work’: Transcribe Bentham and crowdsourcing manuscript collections”, in Ridge (ed) Crowdsourcing our Cultural Heritage, Ashgate, 57-88.

Dunn, S. and Hedges, M. 2014“How the crowd can surprise us: Humanities crowd-sourcing and the creation of knowledge”, in Ridge (ed) Crowdsourcing our Cultural Heritage, Ashgate, 231-246.

Howe, J. 2006, “The rise of crowdsourcing”, Wired, June.

Ridge, M. 2014, “Crowdsourcing our cultural heritage: Introduction”, in Ridge (ed) Crowdsourcing our Cultural Heritage, Ashgate, 1-16.

Riesch, H., Potter, C., 2014. Citizen science as seen by scientists: methodological, epistemological and ethical dimensions. Public Understanding of Science 23 (1), 107–120

Shirk, J.L et al, 2012. Public participation in scientific research: a framework for deliberate design. Ecology and Society 17 (2),

Revuelta-Eugercios-From crowdsourcing cultural heritage to citizen science-197_a.pdf
Revuelta-Eugercios-From crowdsourcing cultural heritage to citizen science-197_c.pdf

2:30pm - 2:45pm
Short Paper (10+5min) [abstract]


Jānis Daugavietis, Rita Treija

Institute of Literature, Folklore and Art - University of Latvia

Survey method using questionnaire for acquiring different kinds of information from the population is old and classic way to collect the data. As examples of such surveys we can trace back to the ancient civilizations, like censuses or standardised agricultural data recordings. The main instrument of this method is question (closed-ended or open-ended) which should be asked exactly the same way to all the representatives of surveyed population. During the last 20-25 years the internet survey method (also called web, electronic, online, CAWI [computer assisted web interview] etc.) is well developed and and more and more frequently employed in social sciences and marketing research, among others. Usually CAWI is designed for acquiring quantitative data, but as in other most used survey modes (face-to-face paper assisted, telephone or mail interviews) it can be used to collect qualitative data, like un- or semi-structured text/ speech, pictures, sounds etc.

In recent years DH (digital humanities) starting to use more often the CAWI alike methodology. At the same time the knowledge of humanitarians in this field is somehow limited (because lack of previous experience and in many cases - education, humanitarian curriculum usually does not include quantitative methods). The paper seeks to analyze specificity of CAWI designed for needs of DH, when the goal of interaction with respondents is to acquire the primary data (eg questioning/ interviewing them on certain topic in order to make a new data set/ collection).

Questionnaires as the approach for collecting data of traditional culture date back to an early stage of the disciplinary history of Latvian folkloristics, namely, to the end of the 19th century and the beginning of the 20th century (published by Dāvis Ozoliņš, Eduard Wolter, Pēteris šmits, Pēteris Birkerts). The Archives of Latvian Folklore was established in 1924. Its founder and the first Head, folklorist and schoolteacher Anna Bērzkalne on regular basis addressed to Archives’ collaborators the questionnaires (jautājumu lapas) on various topics of Latvian folklore. She both created original sets of questions herself and translated into Latvian and adapted those by the Estonian and Finnish folklore scholars (instructions for collecting children’s songs by Walter Anderson; questionnaires of folk beliefs by O. A. F. Mustonen alias Oskar Anders Ferdinand Lönnbohm and Viljo Johannes Mansikka). The localised equivalents were published in the press and distributed to Latvian collectors. Printed questionnaires, such as “House and Household”, “Fishing and Fish”, “Relations between Relatives and Neighbors” and other, presented sets of questions of which were formulated in a suggestive way so that everyone who had some interest could easily engage in the work. The hand-written responses by contributors were sent to the Archives of Latvian Folklore from all regions of the country; the collection of folk beliefs in the late 1920s greatly supplemented the range of materials at the Archives.

However, the life of the survey as a method of collecting folklore in Latvia did not last long. Soon after the World War II it was overcome by the dominance of collective fieldwork and, at the end of the 20th century, by the individual field research, implying mainly the face-to-face qualitative interviews with the informants.

Only in 2017, the Archives of Latvian Folklore revitalized the approach of remote data collecting via the online questionnaires. Within the project “Empowering knowledge society: interdisciplinary perspectives on public involvement in the production of digital cultural heritage” (funded by the European Regional Development Fund), a virtual inquiry module has been developed. The working group of virtual ethnography launched a series of online surveys aimed to study the calendric practices of individuals in the 21st century. Along with working out the iterative inquiry, data accumulation and analysis tools, the researchers have tried to find solutions to the technical and ethical challenges of our day.

Mathematics, sociology and other sciences have developed a coherent theoretical methodology and have accumulated experience based knowledge for online survey tools. That rises several questions, such as:

- How much of this knowledge is known by DH?

- How much are they useful for DH? How different is DH CAWI?

- What would be the most important aspects for DH CAWI?

To answer these questions, we will make a schematic comparison of ‘traditional’ or most common CAWI of social sciences and those of DH, looking at previous experience of our work in fields and institutions of sociology, statistics and humanities.

Daugavietis-CAWI for DH-165_a.docx
Daugavietis-CAWI for DH-165_c.pdf

2:45pm - 3:00pm
Short Paper (10+5min) [abstract]


Susanna Ånäs

Aalto University


Wikidocumentaries is a concept for a collaborative online space for gathering, researching and remediating cultural heritage items from memory institutions, open platforms and the participants. The setup brings together communities of interest and of expertise to work together on shared topics with onnline tools. For the memory organization, Wikidocumentaries offers a platform for crowdsourcing, for amateur and expert researchers it provides peers and audiences, and from the point of view of the open environments, it acts as a site of curation.

Current environments fall short in serving this purpose. Content aggregators focus on gathering, harmonizing and serving the content. Commercial services fail to take into account the open and connected environment in the search for profit. Research environments do not prioritize public access and broad participation. Many participatory projects live short lives from enthusiastic engagement to oblivion due to lack of planning for the sustainability of the results. Wikidocumentaries tries to battle these challenges.

This short paper will be the first attempt in creating an inventory of research topics that this environment surfaces.

the topics

Technologically the main focus of the project is investigating the use of linked open data, and especially proposing the use of Wikidata for establishing meaningful connections across collections and sustainability of the collected data.

Co-creation is an important topic in many senses. What are the design issues of the environment to encourage collaborative creative work? How can the collaboration reach out from the online environment into communities of interest in everyday life? What are the characteristics of the collaborative creations or what kind of creative entrepreneurship can such open environment promote? How to foster and expand a community of technical contributors for the open environments?

The legislative environment sets the boundaries for working. How will privacy and openness be balanced? Which copyright licensing schemes can encourage widest participation? Can novel technologies of personal information management be applied to allow wider participation?

The paper will draw together recent observations from a selection of disciplines for practices in creating participatory knowledge environments.


3:00pm - 3:15pm
Short Paper (10+5min) [abstract]

Heritage Here, K-Lab and intra-agency collaboration in Norway

Vemund Olstad, Anders Olsson

Directorate for Cultural Heritage,

Heritage Here, K-Lab and intra-agency collaboration in Norway


This paper aims to give an overview of an ongoing collaboration between four Norwegian government agencies, by outlining its history, its goals and achievements and its current status. In doing so, we will, hopefully, be able to arrive at some conclusions about the usefulness of the collaboration itself – and whether or not anything we have learned during the collaboration can be used as a model for – or an inspiration to – other projects within the cultural heritage sector or the broader humanities environment.

First phase – “Heritage Here” 2012 – 2015

Heritage Here (or “Kultur- og naturreise” as it is known in its native Norwegian) was a national project which ran between 2012 and 2015 ( The project had two main objectives:

1. To help increase access to and use of public information and local knowledge about culture and nature

2. To promote the use of better quality open data.

The aim being that anyone with a smartphone can gain instant access to relevant facts and stories about their local area wherever they might be in the country.

The project was a result of cross-agency cooperation between five agencies from 3 different ministries. Project partners included:

• the Norwegian Mapping Authority (Ministry of Local Government and Modernization).

• the Arts Council Norway and the National Archives (Ministry of Culture).

• the Directorate of Cultural Heritage and (until December 2014) the Norwegian Environment Agency (the Ministry of Climate and Environment).

Together, these partners made their own data digitally accessible; to be enriched, geo-tagged and disseminated in new ways. Content included information about animal and plant life, cultural heritage and historical events, and varied from factual data to personal stories. The content was collected into Norway’s national digital infrastructure ‘Norvegiana’ ( and from there it can be used and developed by others through open and documented API’s to create new services for business, tourism, or education. Parts of this content were also exported into the European aggregation service (

In 2012 and 2013 the main focus of the project was to facilitate further development of technical infrastructures - to help extract data from partner databases and other databases for mobile dissemination. However, the project also worked with local partners in three pilot areas:

• Bø and Sauherad rural municipalities in Telemark county

• The area surrounding Akerselva in Oslo

• The mountainous area of Dovre in Oppland county.

These pilots were crucial to the project, both as an arena to test the content from the various national datasets, but also as a testing ground for user community participation on a local and regional level. They have also been an opportunity to see Heritage Here’s work in a larger context. The Telemark pilot was for example, used to test the cloud-based mapping tools developed in the Best Practice Network “LoCloud” ( which where coordinated by the National Archives of Norway.

In addition to the previously mentioned activities Heritage Here worked towards being a competence builder – organizing over 20 workshops on digital storytelling and geo-tagging of data, and numerous open seminars with topics ranging from open data and LOD, to IPR and copyright related issues. The project also organized Norway’s first heritage hackathon “#hack4no” in early 2014 ( This first hackathon has since become an annual event – organized by one of the participating agencies (The Mapping authority) – and a great success story, with 50+ participants coming together to create new and innovative services by using open public data.

Drawing on the experiences the project had gathered, the project focused its final year on developing various web-based prototypes which use a map as the users starting point. These demonstrate a number of approaches for visualizing and accessing different types of cultural heritage information from various open data sets in different ways – such as content related to a particular area, route or subject. These prototypes are free and openly accessible as web-tools for anyone to use ( The code to the prototypes has been made openly available so it can be used by others – either as it is, or as a starting point for something new.

Second phase – “K-Lab” 2016 –>

At the end of 2015 Heritage Here ended as a project. But the four remaining project partners decided to continue their digital cross-agency cooperation. So, in January 2016 a new joint initiative with the same core governmental partners was set up. Heritage here went from being a project to being a formalized collaboration between four government agencies. This new partnership is set up to focus on some key issues seen as crucial for further development of the results that came out of the Heritage Here project. Among these are:

• In cooperation develop, document and maintain robust, common and sustainable APIs for the partnerships data and content.

• Address and discuss the need for, and potential use of, different aggregation services for this field.

• Develop and maintain plans and services for a free and open flow of open and reusable data between and from the four partner organizations.

• In cooperation with other governmental bodies organize another heritage hackathon in October 2016 with the explicit focus on open data, sharing, reuse and new and other services for both the public and the cultural heritage management sector.

• As a partnership develop skillsets, networks, arenas and competence for the employees in the four partner organizations (and beyond) within this field of expertise.

• Continue developing and strengthening partnerships on a local, national and international level through the use of open workshops, training, conferences and seminars.

• Continue to work towards improving data quality and promoting the use of open data.

One key challenge at the end of the Heritage here project was making the transition from being a project group to becoming a more permanent organizational entity – without losing key competence and experience. This was resolved by having each agency employing one person from the project each and assigning this person in a 50% position to the K-Lab collaboration. The remaining time was to be spent on other tasks for the agency. This helped ensure the following things:

• Continuity. The same project group could continue working, albeit organized in a slightly different manner.

• Transfer of knowledge. Competence built during Heritage here was transferred to organizational line of the agencies involved.

• Information exchange. By having one employee from each agency meeting on a regular basis information, ideas for common projects and solutions to common problems could easily be exchanged between the collaboration partners.

I addition to the allocation of human resources, each agency chipped in roughly EUR 20.000 as ‘free funds’. The main reasoning behind this kind of approach was to allow the new entity a certain operational freedom and room for creativity – while at the same time tying it closer to the day-to-day running of the agencies.

Based on an evaluation of the results achieved in Heritage Here, the start of 2016 was spent planning the direction forward for K-Lab, and a plan was formulated – outlining activities covering several thematic areas:

Improving data quality and accessibility. Making data available to the public was one of the primary goals of the Heritage here project, and one most important outcomes of the project was the realisation that in all agencies involved there is huge room for improvement in the quality of the data we make available and how we make it accessible. One of K-Lab’s tasks will be to cooperate on making quality data available through well documented API’s and making sure as much data as possible have open licenses that allow unlimited re-use.

Piloting services. The work done in the last year of Heritage Here with the map service mentioned above demonstrated to all parties involved the importance of actually building services that make use of our own open data. K-lab will, as a part of its scope, function as a ‘sandbox’ for both coming up with new ideas for services, and – to the extent that budget and resources allow for it – try out new technologies and services. One such pilot service, is the work done by K-lab – in collaboration with the Estonian photographic heritage society – in setting up a crowdsourcing platform for improving metadata on historic photos (

For 2018, K-Lab will start looking into building a service making use of linked open data from our organizations. All of our agencies are data owners that responsible for authority data in some form or another – ranging from geo names to cultural heritage data and person data. Some work has been done already to bring our technical departments closer in this field, but we plan to do ‘something’ on a practical level next year.

Building competence. In order to facilitate the exchange of knowledge between the collaboration partners K-Lab will arrange seminars, workshops and conferences as arenas for discussing common challenges, learning from each other and building networks. This is done primarily to strengthen the relationship between the agencies involved – but many activities will have a broader scope. One such example is the intention to arrange workshops – roughly every two months – on topics that are relevant to our agencies, but that are open to anyone interested. To give a rough overview of the range of topics, these workshops were arranged in 2017:

• A practical introduction to Cidoc-CRM (May)

• Workshop on Europeana 1914-1918 challenge – co-host: Wikimedia Norway (June)

• An introduction to KulturNAV – co-host: Vestfoldmuseene (September)

• Getting ready for #hack4no (October)

• Transkribus – Text recognition and transcription of handwritten text - co-host: The Munch museum (November)

Third phase – 2018 and beyond

K-lab is very much a work in progress, and the direction it takes in the future depends on many factors. However, a joint workshop was held in September 2017 to evaluate the work done so far – and to try and map out a direction for the future. Employees from all levels in the organisations were present, with invited guests from other institutions from the cultural sector – like the National Library and Digisam from Sweden – to evaluate, discuss and suggest ideas.

No definite conclusions were drawn, but there was an overall agreement that the focus on the three areas described above is of great importance, and that the work done so far by the agencies together has been, for the most part, successful. Setting up arenas for discussing common problems, sharing success stories and interacting with colleagues across agency boundaries has been a key element in the relative success of K-Lab so far. This work will continue into 2018 with focus on thematic groups on linked open data and photo archives, and a new series of workshops is being planned. The experimentation with technology will continue, and hopefully new ideas will be brought forward and realised over the course of the next year(s).

Olstad-Heritage Here, K-Lab and intra-agency collaboration-162_a.pdf

3:15pm - 3:30pm
Short Paper (10+5min) [abstract]

Semantic Annotation of Cultural Heritage Content

Uldis Bojārs1,2, Anita Rašmane1

1National Library of Latvia; 2Faculty of Computing, University of Latvia

This talk focuses on the semantic annotation of textual content and on annotation requirements that emerge from the needs of cultural heritage annotation projects. The information presented here is based on two text annotation case studies at the National Library of Latvia and was generalised to be applicable to a wider range of annotation projects.

The two case studies examined in this work are (1) correspondence (letters) from the late 19th century between two of the most famous Latvian poets Aspazija and Rainis, and (2) a corpus of parliamentary transcripts that document the first four parliament terms in Latvian history (1922-1934).

The first half of the talk focus on the annotation requirements collected and how they may be implemented in practical applications. We propose a model for representing annotation data and implementing annotation systems. The model includes support for three core types of annotations - simple annotations that may link to named entities, structural annotations that mark up portions of the document that have a special meaning within a context of a document and composite annotations for more complex use cases. The model also introduces a separate Entity database for maintaining information about the entities referenced from annotations.

In the second half of the talk we will present a web-based semantic annotation tool that was developed based on this annotation model and requirements. It allows users to import textual documents (various document formats such as HTML and .docx are supported), create annotations and reference the named entities mentioned in these documents. Information about the entities references from annotations is maintained in a dedicated Entity database that supports links between entities and can point to additional information about these entities including Linked Open Data resources. Information about these entities is published as Linked Data. Annotated documents may be exported (along with annotation and entity information) in a number of representations including a standalone web view.

Bojārs-Semantic Annotation of Cultural Heritage Content-264_a.pdf
3:30pm - 4:00pmCoffee break / Surprise Event
Lobby, Porthania
4:00pm - 5:30pmT-PII-3: Augmented Reality
Session Chair: Sanita Reinsone
4:00pm - 4:30pm
Long Paper (20+10min) [abstract]

Extending museum exhibits by embedded media content for an embodied interaction experience

Jan Torpus

University of Applied Sciences and Arts Northwestern Switzerland

Extending museum exhibits by embedded media content for an embodied interaction experience

Investigation topic

Nowadays, museums not only collect, categorize, preserve and present; a museum must also educate and entertain, all the while following market principles to attract visitors. To satisfy this mission, they started to introduce interactive technologies in the 1990s, such as multimedia terminals and audio guides, which have since become standard for delivering contextual information. More recently there has been a shift towards the creation of personalized sensorial experiences by applying user tracking and adaptive user modeling based on location-sensitive and context-aware sensor systems with mobile information retrieval devices. However, the technological gadgets and complex graphical user interfaces (GUIs) generate a separate information layer and detach visitors from the physical exhibits. The attention is drawn to the screen and the interactive technology becomes a competing element with the environment and the exhibited collection [Stille 2003, Goulding 2000, Wakkary 2007]. Furthermore, the vast majority of visitors comes in groups and the social setting gets interrupted by the digital information extension [Petrelli 2016].

First studies about museum visitor behavior were carried out at the end of the 19th and during the 20th Century [Robinson 1928, Melton 1972]. More recently, a significant body of ethnographic research about visitor experience of single persons and groups has contributed studies about technologically extended and interactive installations. Publications about visitor motivation, circulation and orientation, engagement, learning processes, as well as cognitive and affective relationship to the exhibits are of interest for our research approach [Bitgood 2006, Vom Lehn 2007, Dudley 2010, Falk 2011]. Most relevant are studies of the Human Computer Interaction (HCI) researcher community in the fields of Ubiquitous Computing, Tangible User Interfaces and Augmented Reality, investigating hybrid exhibition spaces and the bridging of the material and physical with the technologically mediated and virtual [Hornecker 2006, Wakkary 2007, Benford 2009, Petrelli 2016].


At the Institute of Experimental Design and Media Cultures (IXDM) we have conducted several design research projects applying AR for cultural applications but got increasingly frustrated with disturbing GUIs and physical interfaces such as mobile phones and Head Mounted Displays. We therefore started to experiment with Ubiquitous Computing, the Internet of Things and physical computing technologies that became increasingly accessible for the design community during the last twelve years because of shrinking size and price of sensors, actuators and controllers. In the presented research project, we therefore examine the extension of museum exhibits by physically embedded media technologies for an embodied interaction experience. We intend to overcome problems of distraction, isolation and stifled learning processes with artificial GUIs by interweaving mediated information directly into the context of the exhibits and by triggering events according to visitor behavior.

Our research approach was interdisciplinary and praxis-based including the observation of concept, content and design development and technological implementation processes before the final evaluations. The team was composed of two research partners, three commercial/engineering partners and three museums, closely working together on three tracks: technology, design and museology. The engineering partners developed and implemented a scalable distributed hardware node system and a Linux-based content management system. It is able to detect user behavior and accordingly process and display contextual information. The content design team worked on three case studies following a scenario-driven prototyping approach. They first elaborated criteria catalogues, suitable content and scenarios to define the requirement profiles for the distributed technological environment. Subsequently, they carried out usability studies in the Critical Media Lab of the IXDM and finally set up and evaluated three case studies with test persons. The three museums involved, the Swiss Open-Air Museum Ballenberg, the Roman City of Augusta Raurica and the Museum der Kulturen Basel, all have in common that they exhibit objects or rooms that function as staged knowledge containers and can therefore be extended by means of ubiComp technologies. The three case studies were thematically distinct and offered specific exhibition situations:

• Case study 1: Roman City of Augusta Raurica: “The Roman trade center Schmidmatt“. The primary imparting concept was “oral history”, and documentary film served as a related model: An archaeologist present during the excavations acted as a virtual guide, giving visitors information about the excavation and research methods, findings, hypotheses and reconstructions.

• Case study 2: Open-Air Museum Ballenberg: “Farmhouse from Uesslingen“. The main design investigation was “narratives” about the former inhabitants and the main theme “alcohol”: Its use for cooking, medical application, religious rituals and abuse.

• Case study 3: Museum der Kulturen Basel: “Meditation box“. The main design investigation was “visitor participation” with biofeedback technologies.

Technological development

This project entailed the development of a prototype for a commercial hardware and software toolkit for exhibition designers and museums. Our technology partners elaborated a distributed system that can be composed and scaled according to the specific requirements of an exhibition. The system consists of two main parts:

• A centralized database with an online content management system (CMS) to setup and control the main software, node scripts, media content and hardware configuration. After the technical installation it also allows the museums to edit, update, monitor and maintain their exhibitions.

• Different types of hardware nodes that can be extended by specific types of sensors and actuators. Each node, sensor and actuator has its own separate ID; they are all networked together and are therefore individually accessible via the CMS. A node can run on a Raspberry Pi, for example, an FPGA based on Cyclone V or any desktop computer and can thus be adapted to the required performance.

The modular architecture allows for technological adaption or extension according to specific needs. First modules were developed for the project and then implemented according to the case study scenarios.

Evaluation methods

Through a participatory design process, we developed a scenario for each case study, suitable for walkthrough with several test persons. Comparable and complementary case study scenarios allowed us to identify risks and opportunities for exhibition design and knowledge transfer and define the tasks and challenges for technical implementation. For the visitor evaluation, we selected end-users, experts and in-house museum personnel. The test persons were of various genders and ages (including families with children), had varying levels of technical understanding and little or no knowledge about the project. For each case study we asked about 12 persons or groups of persons to explore the setting as long as they wanted (normally 10–15 minutes). They agreed to be observed and video recorded during the walkthrough and to participate in a semi-structured interview afterwards. We also asked the supervisory staff about their observations and mingled with regular visitors to gain insight into their primary reactions, comments and general behavior. The evaluation was followed by a heuristic qualitative content analysis of the recorded audio and video files and the notes we took during the interviews. Shortly after each evaluation we presented and discussed the results in team workshops.

Findings and Conclusions

The field work lead to many detailed insights about interweaving interactive mediated information directly into the context of physical exhibits. The findings are relevant for museums, design researchers and practitioners, the HCI community and technology developers. We organized the results along five main investigation topics:

1. Discovery-based information retrieval

Unexpected ambient events generate surprise and strong experiences but also contain the risk of information loss if visitors do not trigger or understand the media aids. The concept of unfolding the big picture by gathering distributed, hidden information fragments requires visitor attentiveness. Teasing, timing and the choice of location are therefore crucial to generate flowing trajectories.

2. Embodied interaction

The ambient events are surprising but visitors are not always aware of their interactions. The unconscious mode of interaction lacks of an obvious interaction feedback. But introducing indicated hotspots or modes of interactions destroys the essence of the project’s approach. The fact that visitors do not have to interact with technical devices or learn how to operate graphical user interfaces means that no user groups are excluded from the experience and information retrieval.

3. Non-linear contextual information accumulation

When deploying this project’s approach as a central exhibition concept, information needs to be structured hierarchically. Text boards or info screens are still a good solution for introducing visitors to the ways they can navigate the exhibition. The better the basic topics and situations are initially introduced, the more freedom emerges for selective and memorable knowledge staged in close context to the exhibits.

4. Contextually extended physical exhibits

A crucial investigation topic was the correlation between the exhibit and the media extension. We therefore declined concepts that would overshadow the exhibition and would use it merely as a stage for storytelling with well-established characters or as an extensive media show. The museums requested that media content fade in only shortly when someone approached a hotspot and that there were no technical interfaces or screens for projections that challenged the authenticity of the exhibits. We also discussed to what extend the physical exhibit should be staged to bridge the gap to the media extension.

5. Invisibly embedded technology

The problem of integrating sensors, actuators and controllers into cultural heritage collections was a further investigation topic. We used no visible displays to leave the exhibition space as pure as possible and investigated the applicability of different types of media technologies.

Final conclusion

Our museum partners agreed that the approach should not be implemented as a central concept and dense setting for an exhibition. As often propagated by exhibition makers, first comes the well-researched and elaborated content and carefully constructed story line, and only then the selection of the accurate design approach, medium and form of implementation. This rule also seems to apply to ubiComp concepts and technologies for knowledge transfer. The approach should be applied as a discreet additional information layer or just as a tool to be used when it makes sense to explain something contextually or involve visitors emotionally.


Steve Benford et al. 2009. From Interaction to Trajectories: Designing Coherent Journeys Through User Experiences. Proc. CHI ’09, ACM Press. 709–718.

Stephen Bitgood. 2006. An Analysis of Visitor Circulation: Movement Patterns and the General Value Principle. Curator the museum journal, Volume 49, Issue 4,463–475.

John Falk. 2011. Contextualizing Falk’s Identity-Related Visitor Motivational Model. Visitors Studies. 14, 2, 141-157.

Sandra Dudley. 2010. Museum materialities: Objects, sense and feeling. In Dudley, S. (ed.) Museum Materialities: Objects, Engagements, Interpretations. Routledge, UK, 1-18.

Christina Goulding. 2000. The museum environment and the visitor experience. European Journal of marketing 34, no. 3/4, pp. 261-278.

Eva Hornecker and Jacob Buur. 2006. Getting a Grip on Tangible Interaction: A Framework on Physical Space and Social Interaction. CHI, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 437-446.

Dirk vom Lehn, Jon Hindmarsh, Paul Luff, Christian Heath. 2007. Engaging Constable: Revealing art with new technology. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '07), 1485-1494.

Arthur W. Melton. 1972. Visitor behavior in museums: Some early research in environmental design. In Human Factors. 14(5): 393-403.

Edward S. Robinson. 1928. The behavior of the museum visitor. Publications of the American Association of Museums, New Series, Nr. 5. Washington D.C.

Daniela Petrelli, Nick Dulake, Mark T. Marshall, Anna Pisetti, Elena Not. 2016. Voices from the War: Design as a Means of Understanding the Experience of Visiting Heritage. Proceedings Human-Computer Interaction, San Jose, CA, USA.

Alexander Stille. 2003. The future of the past. Macmillan. Pan Books Limited.

Ron Wakkary and Marek Hatala. 2007. Situated play in a tangible interface and adaptive audio museum guide. Published online: 4 November 2006. Springer-Verlag London Limited.

Torpus-Extending museum exhibits by embedded media content-179_a.pdf

4:30pm - 5:00pm
Long Paper (20+10min) [abstract]

Towards an Approach to Building Mobile Digital Experiences For University Campus Heritage & Archaeology

Ethan Watrall

Michigan State University,

The spaces we inhabit and interact with on a daily basis are made up of layers of cultural activity that are, quite literally, built up over time. While museum exhibits, archaeological narratives, and public programs communicate this heritage, they often don’t allow for the public to experience interactive, place-based, and individually driven exploration of content and spaces. Further, designers of public heritage and archaeology programs rarely explore the binary nature of both the presented content and the scholarly process by which the understanding of that content was reached. In short, the scholarly narrative of material culture, heritage, and archaeology is often hidden from public exploration, engagement, and understanding. Additionally, many traditional public heritage and archaeology programs often find it challenging to negotiate the balance between the voice and goals of the institution and those of communities and groups. In recent years, the maturation of mobile and augmented reality technology has provided heritage institutions, sites of memory and memorialization, cultural landscapes, and archaeological projects with interesting new avenues to present research and engage the public. We are also beginning to see exemplar projects that suggest fruitful models for moving the domain of mobile heritage forward considerably.

University campuses provide a particularly interesting venue for leveraging mobile technology in the pursuit of engaging, place-based heritage and archaeology experiences. University campuses are usually already well traveled public spaces, and therefore don’t elicit the same level of concern that you might find in other contexts for publicly providing the location of archaeological and heritage sites and resources. They have a built in audience of alumni and students eager to better understand the history and heritage of their home campus. Finally, many university campuses are starting to seriously think of themselves as places of heritage and memory, and are developing strategies for researching, preserving, and presenting their own cultural heritage and archaeology.

It is within this context that this paper will explore a deeply collaborative effort at Michigan State University that leverages mobile technology to build an interactive and place-based interpretive layer for campus heritage and archaeology. Driven by the work of the Michigan State University Campus Archaeology Program, an internationally recognized initiative that is unique in its approach to campus heritage, these efforts have unfolded across a number of years and evolved to meet the ever changing need to present the rich and well studied heritage and archaeology of Michigan State University's historic campus.

Ultimately, the goal of this paper is not only to present and discuss the efforts at Michigan State University, but to provide a potential model for other university campuses interested in leveraging mobile technology to produce engaging digital heritage and archaeology experiences.

Watrall-Towards an Approach to Building Mobile Digital Experiences-211_a.pdf
Watrall-Towards an Approach to Building Mobile Digital Experiences-211_c.pdf

5:00pm - 5:30pm
Long Paper (20+10min) [publication ready]

Zelige Door on Golborne Road: Exploring the Design of a Multisensory Interface for Arts, Migration and Critical Heritage Studies

Alda Terracciano

University College London,

In this paper I discuss the multisensory digital interface and art installation Zelige Door on Golborne Road as part of the wider research project ‘Mapping Memory Routes: Eliciting Culturally Diverse Collective Memories for Digital Archives’. The interface is conceived as a tool for capturing and displaying the living heritage of members of Moroccan migrant communities, shared through an artwork composed of a digital interactive sensorial map of Golborne Road (also known as Little Morocco), which includes physical objects related to various aspects of Moroccan culture, each requiring a different sense to be experienced (smell, taste, sight, hearing, touch). Augmented Reality (AR) and olfactory technologies have been used in the interface to superimpose pre-recorded video material and smells to the objects. As a result, the neighbourhood is represented as a living museum of cultural memories expressed in the form of artefacts, sensory stimulation and narratives of citizens living, working or visiting the area. Based on a model I developed for the multisensory installation ‘Streets of...7 cities in 7 minutes’, the interface was designed with Dr Mariza Dima (HCI designer), and Prof. Monica Bordegoni and Dr Marina Carulli (olfactory technology designers) to explore new methods able to elicit cultural Collective Memories through the use of multi-sensory technologies. The tool is also aimed at stimulating collective curatorial practices and democratise decision-making processes in urban planning and cultural heritage.

Terracciano-Zelige Door on Golborne Road-275_a.pdf
Terracciano-Zelige Door on Golborne Road-275_c.pdf
4:00pm - 5:30pmT-PIII-3: Computational Literary Analysis
Session Chair: Mads Rosendahl Thomsen
4:00pm - 4:30pm
Long Paper (20+10min) [abstract]

A Computational Assessment of Norwegian Literary “National Romanticism”

Ellen Rees

University of Oslo,

In this paper, I present findings derived from a computational analysis of texts designated as “National Romantic” in Norwegian literary historiography. The term “National Romantic,” which typically designates literary works from approximately 1840 to 1860 that are associated with national identity formation, first appeared decades later, in Henrik Jæger’s Illustreret norsk litteraturhistorie from 1896. Cultural historian Nina Witoszek has on a number of occasions written critically about the term, claiming that it is misleading because the works it denotes have little to do with larger international trends in Romanticism (see especially Witoszek 2011). Yet, with the exception of a 1985 study by Asbjørn Aarseth, it has never been interrogated systematically in the way that other period designations such as “Realism” or “Modernism” have. Nor does Aarseth’s investigation attempt to delimit a definitive National Romantic corpus or account for the remarkable disparity among the works that are typically associated with the term. “National Romanticism” is like pornography—we know it when we see it, but it is surprisingly difficult to delineate in a scientifically rigorous way.

Together with computational linguist Lars G. Johnsen and research assistants Hedvig Solbakken and Thomas Rasmussen, I have prepared a corpus of 217 text that are mentioned in connection with “National Romanticism” in the major histories of Norwegian literature and textbooks for upper secondary instruction in Norwegian literature. I will discuss briefly some of the logistical challenges associated with preparing this corpus.

This corpus forms the point of departure for a computational analysis employing various text-mining methods in order to determine to what degree the texts most commonly associated with “National Romanticism” share significant characteristics. In the popular imagination, the period is associated with folkloristic elements such as supernatural creatures (trolls, hulders), rural farming practices (shielings, herding), and folklife (music, rituals) as well as nature motifs (birch trees, mountains). We therefore employ topic modeling in order to map the frequency and distribution of such motifs across time and genre within the corpus. We anticipate that topic modeling will also reveal unexpected results beyond the motifs most often associated with National Romanticism. This process should prepare us to take the next step and, inspired by Matthew Wilkens’ recent work generating “clusters” of varieties within twentieth-century U.S. fiction, create visualizations of similarities and differences among the texts in the National Romanticism corpus (Wilkens 2016).

Based on these initial computational methods, we hope to be able to answer some of the following literary historical questions:

¥ Are there identifiable textual elements shared by the texts in the National Romantic canon?

¥ What actually defines a National Romantic text as National Romantic?

¥ Do these texts cluster in a meaningful way chronologically?

¥ Is “National Romanticism” in fact meaningful as a period designation, or alternately as a stylistic designation?

¥ Are there other texts that share these textual elements that are not in the canon?

¥ If so, why? Do gender, class or ethnicity have anything to do with it?

To answer the last two questions, we need to use the “National Romanticism” corpus as a sub-corpus and “trawl-line” within the full corpus of nineteenth-century Norwegian textual culture, carrying out sub-corpus topic modeling (STM) in order to determine where similarities with texts from outside the period 1840–1860 arise (Tangherlini and Leonard 2013). For the sake of expediency, we use the National Library of Norway’s Digital Bookshelf as our full corpus, though we are aware that there are significant subsets of Norwegian textual culture that are not yet included in this corpus. Despite certain limitations, the Digital Bookshelf is one of the most complete digital collections of a national textual culture currently available.

For the purposes of DHN 2018, this project might best be categorized as an exploration of cultural heritage, understood in two ways. On the one hand, the project is entirely based on the National Library of Norway’s Digital Bookshelf platform, which, as an attempt to archive as much as possible of Norwegian textual culture in a digital and publicly accessible archive, is in itself a vehicle for preserving cultural heritage. On the other hand, the concept of “National Romanticism” is arguably the most widespread, but least critically examined means of linking cultural heritage in Norway to a specifically nationalist agenda.


Jæger, Henrik. 1896. Illustreret norsk litteraturhistorie. Bind II. Kristiania: Hjalmar Biglers forlag.

Tangherlini, Timothy R. and Peter Leonard. 2013. “Trawling in the Sea of the Great Unread: Sub-Corpus Topic Modeling and Humanities Research.” Poetics 41.6: 725–749.

Wilkens, Matthew. 2016. “Genre, Computation, and the Varieties of Twentieth-Century U.S. Fiction.” CA: Journal of Cultural Analytics (online open-access)

Witoszek, Nina. 2011. The Origins of the “Regime of Goodness”: Remapping the Cultural History of Norway. Oslo: Universitetsforlaget.

Aarseth, Asbjørn. 1985. Romantikken som konstruksjon: tradisjonskritiske studier i nordisk litteraturhistorie. Bergen: Universitetsforlaget.

Rees-A Computational Assessment of Norwegian Literary “National Romanticism”-213_a.pdf

4:30pm - 4:45pm
Short Paper (10+5min) [abstract]

Prose Rhythm in Narrative Fiction: the case of Karin Boye's Kallocain

Carin Östman, Sara Stymne, Johan Svedjedal

Uppsala university,

Prose Rhythm in Narrative Fiction: the case of Karin Boye’s Kallocain

Swedish author Karin Boye’s (1900-1941) last novel Kallocain (1940) is an icily dystopian depiction of a totalitarian future. The protagonist Leo Kall first embraces this system, but for various reasons rebels against it. The peripety comes when he gives a public speech, questioning the State. It has been suggested (by the linguist Olof Gjerdman) that the novel – which is narrated in the first-person mode – from exactly this point on is characterized by a much freer rhythm (Gjerdman 1942). This paper sets out to test this hypothesis, moving on from a discussion of the concept of rhythm in literary prose to an analysis of various indicators in different parts of Kallocain and Boye’s other novels.

Work on this project started just a few weeks ago. So far we have performed preliminary experiments with simple text quality indicators, like word length, sentence length, and the proportion of punctuation marks. For all these indicators we have compared the first half of the novel, up until the speech, the second half of the novel, and as a contrast also the "censor's addendum", which is a short last chapter of the novel, written by an imaginary censor. For most of these indicators we find no differences between the two major parts of the novel. The only result that points to a more strict rhythm in the first half is that the proportion of long words, both as counted in characters and syllables, are considerably higher there. For instance, the percentage of words with at least five syllables is 1.85% in the first half, and 1.03% in the second half.

The other indicators with a difference does not support the hypothesis, however. In the first half, the sections are shorter, there is proportionally more speech utterances, and there is a higher proportion of three consecutive dots (...), which are often used to mark hesitation. If we compare these two halves to the censor's addendum, however, we can clearly see that the addendum is written in a stricter way, with for instance a considerably higher proportion of long words (4.90% of the words have more than five syllables) and more than double as long sentences.

In future analysis, we plan to use more fine-tuned indicators, based on a dependency parse of the text, from which we can explore issues like phrase length and the proportion of sub-clauses. Separating out speech from non-speech also seems important. We also plan to explore the variation in our indicators, rather than just looking at averages, since this has been suggested in literature on rhythm in Swedish prose (Holm 2015).

Through this initial analysis we have also learned about some of the challenges of analyzing literature. For instance, it is not straightforward to separate speech from non-speech, since the end of utterances are often not clearly marked in Kallocain, and free indirect speech is sometimes used. We think this would be important for future analysis, as well as attribution of speech (Elson & McKeown, 2010), since the speech of the different protagonists cannot be expected to vary in the two parts to the same degree.


Boye, Karin (1940) Kallocain: roman från 2000-talet. Stockholm: Bonniers.

Elson, David K. and McKeown, Kathleen R. (2010) Automatic Attribution of Quoted Speech in Literary Narrative. In Proceedings of the 24th AAAI Conference on Artificial Intelligence. The AAAI Press, Menlo Park, pp 1013–1019.

Gjerdman, Olof (1942) Rytm och röst. In Karin Boye. Minnen och studier. Ed. by M. Abenius and O. Lagercrantz. Stockholm: Bonniers, pp 143–160.

Holm, Lisa (2015) Rytm i romanprosa. In Det skönlitterära språket. Ed. by C. Östman. Stockholm: Morfem, pp 215–235.

Authors: Sara Stymne, Johan Svedjedal, Carin Östman (Uppsala University)

Östman-Prose Rhythm in Narrative Fiction-151_a.docx

4:45pm - 5:00pm
Short Paper (10+5min) [abstract]

The Dostoyevskian Trope: State Incongruence in Danish Textual Cultural Heritage

Kristoffer Laigaard Nielbo, Katrine Frøkjær Baunvig

University of Southern Denmark,

In the history of prolific writers, we are often confronted with the figure of the suffering or tortured

writer. Setting aside metaphysical theories, the central claim seems to be that a state incongruent

dynamic is an intricate part of the creativty process. Two propositions can be derived this claim,

1: the creative state is inversely proportional to the emotional state, and 2: the creative state is

causally predicted by the emotional state. We call this the creative-emotional dynamic ‘The

Dostojevskian Trope’. In this paper we present a method for studying the dostojevskian trope in

prolific writers. The method combines Shannon entropy as an indicator of lexical density and

readability with fractal analysis in order to measure creative dynamics over multiple documents.

We generate a sentiment time series from the same documents and test for causal dependencies

between the creative and sentiment time series. We illustrate the method by searching for the

dostojevskian trope in Danish textual cultural heritage, specifically three highly prolific writers

from the 19th century, namely, N.F.S. Grundtvig, H.C. Andersen, and S.A. Kierkegaard.

Nielbo-The Dostoyevskian Trope-173_a.pdf

5:00pm - 5:30pm
Long Paper (20+10min) [abstract]

Interdisciplinary advancement through the unexpected: Mapping gender discourses in Norway (1840-1913) with Bokhylla

Heidi Karlsen

University of Oslo,

Abstract for long format presentation

Heidi Karlsen, University of Oslo

Ph.D. Candidate in literature, Cand.philol. in philosophy

Interdisciplinary advancement through the unexpected: Mapping gender discourses in Norway (1840-1913) with Bokhylla

This presentation discusses challenges related to sub-corpus topic modeling in the study of gender discourses in Norway from 1840 till 1913 and the role of interdisciplinary collaboration in this process. Through collaboration with the Norwegian National Library, data-mining techniques are used in order to retrieve data from the digital source, Bokhylla [«the Digital Bookshelf»], for the analysis of women’s «place» in society and the impact of women writers on this discourse. My project is part of the research project «Data-mining the Digital Bookshelf», based at the University of Oslo.

1913, the closing year of the period I study, is the year of women’s suffrage in Norway. I study the impact women writers had on the debate in Norway regarding women’s «place» in society, during the approximately 60 years before women were granted the right to vote. A central hypothesis for my research is that women writers in the period had an underestimated impact on gender discourses, especially in defining and loading key words with meaning (drawing on mainly Norman Fairclough’s theoretical framework for discourse analysis). In this presentation, I examine a selection of Swedish writer Fredrika Bremer’s texts, and their impact on gender discourses in Norway.

The Norwegian National Library’s Digital Bookshelf, is the main source for the historical documents I use in this project. The Digital Bookshelf includes a vast amount of text published in Norway over several centuries, text of a great variety of genres, and offers thus unique access to our cultural heritage. Sub-corpus topic modeling (STM) is the main tool that has been used to process the Digital Bookshelf texts for this analysis. A selection of Bremer’s work has been assembled into a sub-corpus. Topics have then been generated from this corpus and then applied to the full Digital Bookshelf corpus. During the process, the collaboration with the National Library has been essential in order to overcome technical challenges. I will reflect upon this collaboration in my presentation. As the data are retrieved, then analyzed by me as a humanities scholar, and weaknesses in the data are detected, the programmer, at the National Library assisting us on the project, presents, modifies and develops tools in order to meet our challenges. These tools might in turn represent additional possibilities beyond what they were proposed for. New ideas in my research design may emerge as a result. Concurrently, the algorithms created at such a stage in the process, might successively be useful for scholars in completely different research projects. I will mention a few examples of such mutually productive collaborations, and briefly reflect upon how these issues are related to questions regarding open science.

In this STM process, several challenges have emerged along the way, mostly related to OCR errors. Some illustrative examples of passages with such errors will be presented for the purpose of discussing the measures undertaken to face the problems they give rise to, but also for demonstrating the unexpected progress stemming from these «defective» data. The topics used as a «trawl line»(1), in the initial phase of this study, produced few results. Our first attempt to get more results was to revise down the required Jaccard similarity(2). This entails that the quantity of a topic that had to be identified in a passage in order for it to qualify as a hit, is lowered. As this required topic quantity was lowered, a great number of results were obtained. The obvious weakness of these results, however, is that the rather low required topic match, or relatively low value of the required Jaccard similarity, does not allow us to affirm a connection between these passages and Bremer’s text. Nevertheless, the results have still been useful, for two reasons. Some of the data have proven to be valuable sources for the mapping of gender discourses, although not indicating anything regarding women writer’s impact on them. Moreover, these passages have served to illustrate many of the varieties of OCR errors that my topic words give rise to in text from the period I study (frequently in Gothic typeface). This discovery has then been used to improve the topics, which takes us to the next step in the process.

In certain documents one and the same word in the original text has, in the scanning of the document, given rise to up to three different examples of OCR errors(3). This discovery indicates the risk of missing out on potentially relevant documents in the «great unread»(4). If only the correct spelling of the words is included in the topics, potentially valuable documents with our topic words in them, bizarrely spelled because of errors in the scanning, might go unnoticed. In an attempt to meet this challenge I have manually added to the topic the different versions of the words that the OCR errors have given rise to (for instance for the word «kjærlighed» [love] «kjaerlighed», «kjcerlighed», «kjcrrlighed»). We cannot in that case, when we run the topic model, require a one hundred percent topic match, perhaps not even 2/3, as all these OCR errors of the same word are highly unlikely to take place in all potential matches(5). Such extensions of the topics, condition in other words our parameterization of the algorithm: the required value of Jaccard similarity for a passage to be captured has to be revised fairly down. The inconvenience of this approach, however, is the possible high number of captured passages that are exaggeratedly (for our purpose) saturated with the semantic unit in question. Furthermore, if we add to this the different versions of a lexeme and its semantic relatives that in some cases are included in the topic, such as «kvinde», «kvinder», «kvindelig», kvindelighed» [woman, women, feminine, femininity], the topic in question might catch an even larger number of passages with a density of this specific semantic unity with its variations; this is an amount that is not proportional to the overall variety of the topic in question.

This takes us back to the question of what we program the “trawl line” to “require” in order for a passage in the target corpus to qualify as a hit, and as well to how the scores are ranged. How many of the words in the topic, and to what extent do several occurrences of one of the topic’s words, i.e., five occurrences of “woman” in one paragraph interest us? The parameter can be set to range scores in function of the occurrences of the different words forming the topic, meaning that the score for a topic in a captured passage is proportional to the heterogeneity of the occurrences of the topic’s words, not only the quantity. However, in some cases we might, as mentioned, have a topic comprehending several forms of the same lexeme and its semantic relatives and, as described, several versions of the same word due to OCR errors. How can the topic model be programmed in order to take into account such occurrences in the search for matching passages? In order to meet this challenge, a «hyperlexeme sensitive» algorithm has been created (6). This means that the topic model is parameterized to count the lexeme frequency in a passage. It will also range the scores in function of the occurrence of the hyperlexeme, and not treat occurrences of different forms of one lexeme equally to the ones of more semantically heterogenous word-units in the topic. Furthermore, and this is the point to be stressed, this algorithm is programmed to treat miss-spelling of words, due to OCR errors, as if they were different versions of the same hyperlexeme.

The adjustments of the value of the Jaccard similarity and the hyperlexeme parameterization are thus measures conducted in order to compensate for the mentioned inconveniences, and improve and refine the topic model. I will show examples that compare the before and after these parameters were used, in order to discuss how much closer we have got to be able to establish actual links between the sub-corpus, and passages the topics have captured in the target corpus. All the technical concepts will be defined and briefly explained as I get to them in the presentation. The genesis of these measures, tools and ideas at crucial moments in the process, taking place as a result of unexpected findings and interdisciplinary collaboration, will be elaborated on in my presentation, as well as the potential this might offer for new research.


(1) My description of the STM process, with the use of tropes such as «trawl line» is inspired by Peter Leonard and Timothy R. Tangherlini (2013): “Trawling in the Sea of the Great Unread: Sub-corpus topic modeling and Humanities research” in Poetics. 41, 725-749

(2) The Jaccard index is taken into account in the ranging of the scores. The best hit passage for a topic, the one with highest score, will be the one with highest relative similarity to the other captured passages, in terms of concentration of topic words in the passage. The parameterized value of the required Jaccard similarity defines the score a passage must receive in order to be included in the list of captured passages from the «great unread».

(3) Some related challenges were described by Kimmo Kettunen and Teemu Ruokolainen in their presentation, «Tagging Named Entities in 19th century Finnish Newspaper Material with a Variety of Tools» at DHN2017.

(4) Franco Moretti (2000) (drawing on Margareth Cohen) calls the enormous amount of works that exist in the world for «the great unread» (limited to Bokhylla’s content in the context of my project) in: «Conjectures of World Literature» in New Left Review. 1, 54-68.

(5) As an alternative to include in the topic all detected spelling variations, due to OCR errors, of the topic words, we will experiment with taking into account the Levenshtein distance when programming the «trawl line». In that case it is not identity between a topic word and a word in a passage in the great unread that matters, but the distance between two words, the minimum number of single-character edits required to change one word into the other, for instance «kuinde»-> «kvinde».

(6) By the term «hyperlexeme» we understand a collection of graphemic occurences of a lexeme, including spelling errors and semantically related forms.

Karlsen-Interdisciplinary advancement through the unexpected-240_a.pdf
4:00pm - 5:30pmT-PIV-3: Legal and Ethical Matters
Session Chair: Christian-Emil Smith Ore
4:00pm - 4:30pm
Long Paper (20+10min) [abstract]

Breaking Bad (Terms of Service)? The DH-scholar as Villain

Pelle Snickars

Umea University,

For a number of years I have been heading a major research project on Spotify (funded by the Swedish Research Council). It takes a software studies and digital humanities approach towards streaming media, and engages in reverse engineering Spotify’s algorithms, aggregation procedures, metadata, and valuation strategies. During the summer of 2017 I received an email from a Spotify legal counsel who was ”concerned about information it received regarding methods used by the responsible group of researchers in this project. This information suggests that the research group systematically violated Spotify’s Terms of Use by attempting to artificially increase plays, among others, and to manipulate Spotify’s services with the help of scripts or other automated processes.” I responded politely—but got no answer. A few weeks later, I received a letter from the senior legal advisor at my university. Spotify had apparently contacted the Research Council with the claim that our research was questionable in a way that would demand “resolute action”, and the possible termination of financial and institutional support. At the time of writing it is unclear if Spotify will file a lawsuit—or start a litigation process.

DH-research is embedded in ’the digital’—and so are its methods, from scraping web content to the use of bots as research informants. Within scholarly communities centered on the study of the web or social media there is a rising awareness of the ways in which digital methods might be non-compliant with commercial Terms of Service (ToS)—a discussion which has not yet filtered out and been taken serious within the digital humanities. However, DH-researchers will in years to come increasingly have to ask themselves if their scholarly methods need to abide by ToS—or not. As social computing researcher Amy Bruckman has stated, it might have profound scholarly consequences: ”Some researchers choose not to do a particular piece of work because they believe they can’t violate ToS, and then another researcher goes and does that same study and gets it published with no objections from reviewers.”

My paper will recount my legal dealings with Spotify—including a discussion of the digital methods used in our project—but also more generally reflect around the ethical implications of collecting data in novel ways. ToS are contracts—not the law, still there is a dire need for ethical justifications and scholarly discussions why the importance of academic research justifies breaking ToS.

Snickars-Breaking Bad (Terms of Service) The DH-scholar as Villain-100_a.pdf
Snickars-Breaking Bad (Terms of Service) The DH-scholar as Villain-100_c.pdf

4:30pm - 4:45pm
Short Paper (10+5min) [abstract]

Legal issues regarding tradition archives: the Latvian case study.

Liga Abele, Anita Vaivade

Latvian Academy of Culture,

Historically, the tradition archives have had their course of development rather apart, both in form and in substance, from the formation process of other types of cultural heritage collections held by “traditional” archives, museums and libraries. However, for positive influence of current trends in development of technical and institutional capacities to be fully exercised on the managerial approaches, there must be increased legal certainty regarding status and functioning of the tradition archives. There are several trajectories through which tradition archives can be and are influenced by the surrounding legal and administrative framework both at national and regional level. A thorough knowledge of the impact from the existing regulatory base can contribute to informed decision making in consistence with the role that these archives play in safeguarding the intangible cultural heritage. In the paper a case study of the current Latvian situation would be presented within a broader regional perspective of the three Baltic states. The legal framework of interest is defined by the institutional status of tradition archives, the legal status of the collections, as well as legal provisions and restrictions regarding functioning (work) that involves gathering, processing and further use of the archive material.

The paper is based on the data gathered within the EUSBSR Seed Money Facility project “DigArch_ICH. Connecting Digital Archives of Intangible Heritage” (No. S86, 2016/2017) executed in partnership of the Institute of Literature, Folklore and Art of the University of Latvia, the Estonian Folklore Archives, the Estonian Literary Museum, the Norwegian Museum of Cultural History and the Institute for Language and Folklore in Sweden. One of the several thematic lines of the project dealt with legal and ethical issues, asking national experiences about legal concerns and restrictions for the work of tradition archives, legal status of collections of tradition archives, practice on signing written agreements between researcher and informant, as well as existing codes of ethics that would be applied to the work of tradition archives. Responses were received from altogether 21 institutions from 11 countries of the Baltic Sea region and neighbouring countries.

Fields of the legislation involved.

There are several fields of national legislation that influence the work of tradition archives, such as the regulations on intangible heritage, documentary heritage, work of memory institutions (archives, museums, libraries), intellectual property and copyright, as well as protection of personal data. Depending on the national legislative choices and contexts, these fields may be regulated by separate laws, or overarching laws (for instance, in the field of cultural heritage, covering both intangible as well as documentary heritage protection), or some of these fields may remain uncovered by the national legislation.

According to the results of the survey, the legal status of the tradition archives can be rather diverse. They can be part of larger institutions such as universities, museums, or libraries. In the Latvian situation, there are specific laws for every type of the above-mentioned institutions that can entail large variety of rule sets. The status of the collections can differ also depending on whether they are recognised as part of the protected national treasures (such as the national collections of museums, archives etc.). The ownership status can be rather divers, taking into consideration the collections belonging to the state or privately owned. Moreover, ownership rights of the same collection can be split between various types of owners of similar or varied legal status. The paper proposes to identify and analyse in the Latvian situation the consequences for the collections of the tradition archives depending on the institutional status of their holder, their ownership status and influence exercised by legislation in the fields of copyright and intellectual property law as well as data protection. The Latvian case would be put into perspective of the experience by the Estonian and the Lithuanian situation.

International influence on the national legislation.

The national legislation is influenced by the international normative instruments of different level, ranging from the global perspective (the UNESCO) to the regional level, in this case – the European scope. At the global level there are several instruments ranging from legally binding instruments to the “soft law”, such as the 2003 UNESCO Convention for the Safeguarding of the Intangible Cultural Heritage or the 2015 UNESCO Recommendation Concerning the Preservation of, and Access to, Documentary Heritage Including in Digital Form. Concerning the work of tradition archives, this 2003 Convention namely relates to the documentation of intangible cultural heritage as well as establishment of national inventories of such heritage, and in this regard tradition archives may have or establish their role in the national policy implementation processes. The European regional legislation and policy documents are of relevance, adopted either by the European Council, or by the European Union. They concern the field of cultural heritage (having a general direction towards an integrated approach towards various cultural heritage fields), as well as private data protection and copyright and intellectual property rights. The role of the legally binding legal instruments of the European Union, such as directives and regulations, would be examined through perspective of the national legislation related to the tradition archives

Aspects of deontology.

As varied deontological aspects affect functioning of the tradition archives, these issues will be examined in the paper. There are national codes of ethics that may apply to the work of tradition archives, either from the perspective of research or in relation to archival work. Within the field of intangible cultural heritage, the issues of ethics have been also debated internationally over the recent years, with recognised topicality as for different stakeholders involved. Thus, UNESCO Intergovernmental Committee for the Safeguarding of the Intangible Cultural Heritage adopted in 2015 the Ethical Principles for Safeguarding Intangible Cultural Heritage. This document has a role of providing recommendations to various persons and institutions that are part of safeguarding activities, and this concerns also the work of tradition archives. There are also international deontology documents that concern the work of archives, museums and libraries. These documents would be referred to in a complementary manner, taking into consideration the specificity of tradition archives. Namely, the 1996 International Council of Archives (ICA) Code of Ethics. Although this code of ethics does not highlight archives that deal with folklore and traditional culture materials, it nevertheless sets general principles for the archival work, as well as cooperation of archives, and puts an emphasis also to the preservation of the documentary heritage. Another important deontological reference for tradition archives concerns the work of museum, which is particularly significant for archives that function as units in larger institutions – museums. Internationally well-known and often mentioned reference is the 2004 (1986) International Council of Museums (ICOM) Code of Ethics for Museums. A reference may be given also to the 2012 International Federation of Library Associations (IFLA) Code of Ethics for Librarians and other Information Workers.

Abele-Legal issues regarding tradition archives-158_a.pdf

4:45pm - 5:00pm
Short Paper (10+5min) [abstract]

Where are you going, research ethics in Digital Humanities?

Sari Östman, Elina Vaahensalo, Riikka Turtiainen

University of Turku,

1 Background

In this paper we will examine the current state and future development of research ethics among Digital Humanities. We have analysed

a) ethics-focused inquiries with researchers in a multidisciplinary consortium project (CM)

b) Digital Humanities -oriented journals and

c) the objectives of the DigiHum Programme at the Academy of Finland, ethical guidelines of AoIR (Association of Internet Researchers. AoIR has published an extensive set of ethical guidelines for online research in 2002 and 2012) and academical ethical boards and committees, in particular the one at the University of Turku. We are planning on analysing the requests for comments which have not been approved in the ethical board at the Univ. of Turku. For that, we need a research permission from administration of University of Turku – which is in process.

Östman and Vaahensalo work in the consortium project Citizen Mindscapes (CM), which is part of the Academy of Finland’s Digital Humanities Programme. University Lecturer Turtiainen is using a percentage of her work time for the project.

In the Digital Humanities program memorandum, ethical examination of the research field is mentioned as one of the main objectives of the program (p. 2). The CM project has a work package for researching research ethics, which Östman is leading. We aim at examining the current understanding of ethics in multiple disciplines, in order to find some tools for more extensive ethical considerations especially in multidisciplinary environments. This kind of a toolbox would bring more transparency into multidisciplinary research.

Turtiainen and Östman have started developing the ethical toolbox for online research already in their earlier publications (see f. ex. Turtiainen & Östman 2013; Östman & Turtiainen 2016; Östman, Turtiainen & Vaahensalo 2017). The current phase is taking the research of the research ethics into more analytical level.

2 Current research

When we are discussing such a field of research as Digital Humanities, it is quite clear than online specific research ethics (Östman & Turtiainen 2016; Östman, Turtiainen & Vaahensalo 2017) plays on especially significant role in it. Research projects often concentrate on one source or topic with a multidisciplinary take: the understandings of research ethics may fundamentally vary even inside the same research community. Different ethical focal points and varying understandings could be a multidisciplinary resource, but it is essential to recognize and pay attention to the varying disciplinary backgrounds as well as the online specific research contexts. Only by taking these matters into consideration, we are able to create some functional ethical guidelines for multidisciplinary online-oriented research.

The Inquiries in CM24

On the basis of the two rounds of ethical inquiry within the CM24 project, the researchers seemed to consider most focal such ethical matters as anonymization, dependence on corporations, co-operation with other researchers and preserving the data. By the answers ethical views seemed to

a) individually constructed: the topic of research, methods, data plus the personal view to what might be significant

b) based on one’s education and discipline tradition

c) raised from the topics and themes the researcher had come in touch with during the CM24 project (and in similar multidisciplinary situations earlier)

One thing seemingly happening with current trend of big data usage, is that even individually produced online material is seen as mass; faceless, impersonalized data, available to anyone and everyone. This is an ethical discussion which was already on in the early 2000’s (see f. ex. Östman 2007, 2008; Turtiainen & Östman 2009) when researchers turned their interest in online material for the first time. It was not then, and it is not now, ethically durable research, to consider the private life- and everyday -based contents of individual people as ’take and run’ -data. However, this seems to be happening again, especially in disciplines where ethics has mostly focused on copyrights and maybe corporal and co-operational relationships. (In the CM24 for example information science seems to be one of the disciplines where intimate data is used as faceless mass.) Then again, a historian among the project argues in their answer, that already choosing an online discussion as an object to research is an ethical choice, ”shaping what we can and should count in into the actual research”.

Neither one of above-mentioned ethical views is faulty. However, it might be difficult for these two researchers to find a common understanding about ethics, in for example writing a paper together. A multifaceted, generalized collection of guidelines for multidisciplinary research would probably be of help.

Digital Humanities Journals and Publications

To explore ethics in digital humanities, we needed a diverse selection of publications to represent research in Digital Humanities. Nine different digital humanities journals were chosen for analysis, based on the listing made by Berkeley University. The focus in these journals varies from pedagogy to literary studies. However, they all are digital humanities oriented. The longest-running journal on the list has been published since 1986 and the most recent journals have been released for the first time in 2016. The journals therefore cover the relatively long-term history of digital humanities and a wide range of multi- and interdisciplinary topics.

In the journals and in the articles published in them, research ethics is clearly in the side, even though it is not entirely ignored. In the publications, research ethics is largely taken into account in the form of source criticism. Big data, digital technology and copyright issues related to research materials and multidisciplinary cooperation are the most common examples of research ethical considerations. Databases, text digitization and web archives are also discussed in the publications. These examples show that research ethics also affect digital humanities, but in practice, research ethics are relatively scarce in publications.

Publications of the CM project were also examined, including some of our own articles. Except for one research ethics oriented article (Östman & Turtiainen 2016) most of the publications have a historical point of view (Suominen 2016; Suominen & Sivula 2016; Saarikoski 2017; Vaahensalo 2017). For this reason, research ethics is reflected mainly in the form of source criticism and transparency. Ethics in these articles is not discussed in more length than in most of the examined digital humanities publications.

Also in this area, a multifaceted, generalized collection of guidelines for multidisciplinary research would probably be of benefit: it would be essentially significant to increase the transparency in research reporting, especially in Digital Humanities, which is complicated and multifaceted of disciplinary nature. Therefore more thorough reporting of ethical matters would increase the transparency of the nature of Digital Humanities in itself.

The Ethics Committee

The Ethics committee of the University of Turku follows the development in the field of research ethics both internationally and nationally. The mission of the committee is to maintain a discussion on research ethics, enhance the realisation of ethical research education and give advice on issues related to research ethics. At the moment its main purpose is to assess and give comments on the research ethics of non-medical research that involves human beings as research subjects and can cause either direct or indirect harm to the participants.

The law about protecting personal info of private citizens appears to be a significant aspect of research ethics. Turtiainen (member of the committee) states that, at the current point, one of the main concerns seems to be poor data protection. The registers constructed of the informant base are often neglected among the humanities, whereas such disciplines as psychology and wellfare research approximately consider them on the regular basis. Then again, the other disciplines do not necessarily consider other aspects of vulnerability so deeply as the (especially culture/tradition-oriented) humanists seem to do.

Our aim is to analyse requests for comments which have not been approved and have therefore been asked to modify before recommendation or re-evaluation. Our interest focuses in arguments that have caused the rejection. Before that phase of our study we need a research permission of our own from the administration of University of Turku – which is in process. It would be an interesting viewpoint to compare the rejected requests for comments from the ethics committee to the results of ethical inquiries within the CM24 project and the outline of research ethics in digital humanities journals and publications.

3 Where do you go now…

According to our current study, it seems that the position of research ethics in Digital Humanities and, more widely, in multidisciplinary research, is somewhat two-fold:

a) for example in the Digital Humanities Program of the Academy of Finland, the significance of ethics is strongly emphasized and the research projects among the program are being encouraged to increase their ethical discussions and the transparency of those. The discourse about and the interest in developing online-oriented research ethics seems to be growing and suggesting that ’something should be done’; the ethical matters should be present in the research projects in a more extensive way.

b) however, it seems that in practice the position of research ethics has not changed much within the last 10 years or so, despite the fact that the digital research environments of the humanities have become more and more multidisciplinary, which leads to multiple understandings about ethics even within individual research projects. Yet, the ethics in research reports is not discussed in more length / depth than earlier. Even in Digital Humanities -oriented journals, ethics is mostly present in a paragraph or two, repeating a few similar concerns in a way which at times seems almost ’automatic’; that is, as if the ethical discussion would have been added ’on the surface’ hastily, because it is required from the outside.

This is an interesting situation. There is a possibility that researchers are not taking seriously the significance of ethical focal points in their research. This is, however, an argument that we would not wish to make. We consider it more likely that in the ever-changing digital research environment, the researches lack multidisciplinary tools for analyzing and discussing ethical matters in the depth that is needed. By examining the current situation extensively, our study is aiming at finding the focal ethical matters in multidisciplinary research environments, and at constructing at least a basic toolbox for Digital Humanities research ethical discussions.

Sources and Literature

Inquiries made by Östman, Turtiainen and Vaahensalo with the researchers the Citizen Mindscapes 24 project. Two rounds in 2016–2017.

Digital Humanities (DigiHum). Academy Programme 2016–2019. Programme memorandum. Helsinki: Academy of Finland.

Digital Humanities journals listed by Digital Humanities at Berkeley.

Markham, Annette & Buchanan, Elizabeth 2012: Ethical Decision-Making and Internet Research: Recommendations from the AoIR Ethics Working Committee (Version 2.0).

Saarikoski, Petri: “Ojennat kätesi verkkoon ja joku tarttuu siihen”. Kokemuksia ja muistoja kotimaisen BBS-harrastuksen valtakaudelta. Tekniikan Waiheita 2/2017.

Suominen, Jaakko (2016): ”Helposti ja halvalla? Nettikyselyt kyselyaineiston kokoamisessa.” In: Korkiakangas, Pirjo, Olsson, Pia, Ruotsala, Helena, Åström, Anna-Maria (eds.): Kirjoittamalla kerrotut – kansatieteelliset kyselyt tiedon lähteinä. Ethnos-toimite 19. Ethnos ry., Helsinki, 103–152. [Easy and Cheap? Online surveys in cultural studies.]

Suominen, Jaakko & Sivula, Anna (2016): “Digisyntyisten ilmiöiden historiantutkimus.” In Elo, Kimmo (ed.): Digitaalinen humanismi ja historiatieteet. Historia Mirabilis 12. Turun Historiallinen Yhdistys, Turku, 96–130. [Historical Research of Born Digital Phenomena.]

Turtiainen, Riikka & Östman, Sari 2013: Verkkotutkimuksen eettiset haasteet: Armi ja anoreksia. In: Laaksonen, Salla-Maaria et. al. (eds.): Otteita verkosta. Verkon ja sosiaalisen median tutkimusmenetelmät. Tampere: Vastapaino. pp. 49–67.

– 2009: ”Tavistaidetta ja verkkoviihdettä – omaehtoisten verkkosisältöjen tutkimusetiikkaa.” Teoksessa Grahn, Maarit ja Häyrynen, Maunu (toim.) 2009: Kulttuurituotanto – Kehykset, käytäntö ja prosessit. Tietolipas 230. SKS, Helsinki. 2009. s. 336–358.

Vaahensalo, Elina: Kaikenkattavista portaaleista anarkistiseen sananvapauteen – Suomalaisten verkkokeskustelufoorumien vuosikymmenet. Tekniikan Waiheita 2/2017.

Östman, Sari 2007: ”Nettiksistä blogeihin: Päiväkirjat verkossa.” Tekniikan Waiheita 2/2007. Tekniikan historian seura ry. Helsinki. 37–57.

Östman, Sari 2008: ”Elämäjulkaiseminen – omaelämäkerrallisten traditioiden kuopus.” Elore, vol. 15-2/2008. Suomen Kansantietouden Tutkijain Seura.

Östman, Sari & Turtiainen, Riikka 2016: From Research Ethics to Researching Ethics in an Online Specific Context. In Media and Communication, vol. 4. iss. 4. pp. 66¬–74.

Östman, Sari, Riikka Turtiainen & Elina Vaahensalo 2017: From Online Research Ethics to Researching Online Ethics. Poster. Digital Humanities in the Nordic Countries 2017 Conference.

Östman-Where are you going, research ethics in Digital Humanities-180_a.pdf

5:00pm - 5:15pm
Short Paper (10+5min) [abstract]

Copyright exceptions or licensing : how can a library acquire a digital game?

Olivier Charbonneau

Concordia University,

Copyright, caught in a digital maelstrom of perpetual reforms and shifting commercial practices, exacerbates tensions between cultural stakeholders. On the one hand, copyright seems to be drowned in Canada and the USA by the role reserved to copyright exceptions by parliaments and the courts. On the other, institutions, such as libraries, are keen to navigate digital environments by allocating their acquisitions budgets to digital works. How can markets, social systems and institutions emerge or interact if we are not able to resolve this tension?

Beyond the paradigm shifts brought by digital technologies or globalization, one must recognize the conceptual paradox surrounding digital copyrighted works. In economic terms, they behave naturally as public goods, while copyright attempts to restore their rivalrousness and excludability. Within this paradox lies tension, between the aggregate social wealth spread by a work and its commoditized value, between network effects and reserved rights.

In this paper, I will summarize the findings of my doctoral research project and apply them to the case of digital games in libraries.

The goal of my doctoral work was to ascertain the role of libraries in the markets and social systems of digital copyrightable works. Ancillary goals included exploring the “border” between licensing and exceptions in the context of heritage institutions as well as building a new method for capturing the complexity of markets and social systems that stem from digital protected works. To accomplish these goals, I analysed a dataset comprising of the terms and conditions of licenses held by academic libraries in Québec. I show that the terms of these licences overlap with copyright exceptions, highlighting how Libraries express their social mission in two normative contexts: positive law (copyright exceptions) and private ordering (licensing). This overlap is both necessary yet poorly understood - they are not two competing institutional arrangements but the same image reflected in two distinct normative settings. It also provides a road-map for right-holders of how to make digital content available through libraries.

The study also points to the rising importance of automation and computerization in the provisioning of licences in the digital world. Metadata describing the terms of a copyright licence are increasingly represented in computer models and leveraged to mobilize digital corpus for the benefit of a community. Whereas the print world was driven by assumptions and physical limits to using copyrighted works, the digital environment introduces new data points for interactions which were previously hidden from scrutiny. The future lies not in optimizing transaction costs but in crafting elegant institutional arrangements through licensing.

If libraries exist to capture some left-over value in the utility curve of our cultural, informational or knowledge markets, the current role they play in copyright need not change in the digital environment. What does change, however, is hermeneutics: how we attribute value to digital copyrighted works and how we study society’s use of them.

We conclude by transposing the results of this study to the case of digital games. Québec is currently a hotbed for both independent and AAA video game studios. Despite this, a market failure currently exists due to the absence of flexible licensing mechanisms to make indie games available through libraries. This part of the study was funded with the generous support from the Knight Foundation in the USA and conducted at the Technoculture Art & Games (TAG) research cluster of the Milieux Institute for arts, culture and technology at Concordia University in Montréal, Canada.

Charbonneau-Copyright exceptions or licensing-181_a.pdf
4:00pm - 5:30pmT-P674-3: Database Design
Session Chair: Jouni Tuominen
4:00pm - 4:30pm
Long Paper (20+10min) [publication ready]

Open Science for English Historical Corpus Linguistics: Introducing the Language Change Database

Joonas Kesäniemi1, Turo Vartiainen2, Tanja Säily2, Terttu Nevalainen2

1University of Helsinki, Helsinki University Library; 2University of Helsinki, Department of Modern Languages

This paper discusses the development of an open-access resource that can be used as a baseline for new corpus-linguistic research into the history of English: the Language Change Database (LCD). The LCD draws together information extracted from hundreds of corpus-based articles that investigate the ways in which English has changed in the course of history. The database includes annotated summaries of the articles, as well as numerical data extracted from the articles and transformed into machine-readable form, thus providing scholars of English with the opportunity to study fundamental questions about the nature, rate and direction of language change. It will also make the work done in the field more cumulative by ensuring that the research community will have continuous access to existing results and research data.

We will also introduce a tool that takes advantage of this new source of structured research data. The LCD Aggregated Data Analysis workbench (LADA) makes use of annotated versions of the numerical data available from the LCD and provides a workflow for performing meta-analytical experimentations with an aggregated set of data tables from multiple publications. Combined with the LCD as the source of collaborative, trusted and curated linked research data, the LADA meta-analysis tool demonstrates how open data can be used in innovative ways to support new research through data-driven aggregation of empirical findings in the context of historical linguistics.

Kesäniemi-Open Science for English Historical Corpus Linguistics-183_a.pdf
Kesäniemi-Open Science for English Historical Corpus Linguistics-183_c.pdf

4:30pm - 4:45pm
Short Paper (10+5min) [abstract]

“Database Thinking and Deep Description: Designing a Digital Archive of the National Synchrotron Light Source (NSLS)”

Elyse Graham

Stony Brook University,

Our project involves developing a new kind of digital resource to capture the history of research at scientific facilities in the era of the “New Big Science.” The phrase “New Big Science” refers to the post-Cold War era at US national laboratories, when large-scale materials science accelerators rather than high-energy physics accelerators became marquee projects at most major basic research laboratories. The extent, scope, and diversity of research at such facilities makes keeping track of it difficult to compile using traditional historical methods and linear narratives; there are too many overlapping and bifurcating threads. The sheer number of experiments that took place at the NSLS, and the vast amount of data that it produced across many disciplines, make it nearly impossible to gain a comprehensive global view of the knowledge production that took place at this facility.

We are therefore collaborating to develop a new kind of digital resource to capture the full history of this research. This project will construct a digital archive, along with an associated website, to obtain a comprehensive history of the National Synchrotron Light Source at Brookhaven National Laboratory. The project specifically will address the history of “the New Big Science” from the perspectives of data visualization and the digital humanities, in order to demonstrate that new kinds of digital tools can archive and present complex patterns of research and configurations of scientific infrastructure. In this talk, we briefly discuss methods of data collection, curation, and visualization for a specific case project, the NSLS Digital Archive.

Graham-“Database Thinking and Deep Description-113_a.docx

4:45pm - 5:00pm
Distinguished Short Paper (10+5min) [publication ready]

Integrating Prisoners of War Dataset into the WarSampo Linked Data Infrastructure

Mikko Koho1, Erkki Heino1, Esko Ikkala1, Eero Hyvönen1,2, Reijo Nikkilä3, Tiia Moilanen3, Katri Miettinen3, Pertti Suominen3

1Semantic Computing Research Group (SeCo), Aalto University, Finland; 2HELDIG - Helsinki Centre for Digital Humanities, University of Helsinki, Finland; 3The National Prisoners of War Project

One of the great promises of Linked Data and the Semantic Web standards is to provide a shared data infrastructure into which more and more data can be imported and aligned, forming a sustainable, ever growing knowledge graph or linked data cloud, Web of Data. This paper studies and evaluates this idea in the context of the WarSampo Linked Data cloud, providing an infrastructure for data related to the Second World War in Finland. As a case study, a new database of prisoners of war with related contents is considered, and lessons learned discussed in relation to using traditional data publishing approaches.

Koho-Integrating Prisoners of War Dataset into the WarSampo Linked Data Infrastructure-253_a.pdf

5:00pm - 5:15pm
Short Paper (10+5min) [abstract]

"Everlasting Runes": A Research Platform and Linked Data Service for Runic Research

Magnus Källström1, Marco Bianchi2, Marcus Smith1

1Swedish National Heritage Board; 2Uppsala University

"Everlasting Runes" (Swedish: "Evighetsrunor") is a three-year collaboration between the Swedish National Heritage Board and Uppsala University, with funding provided by the Bank of Sweden Tercentenary Foundation (Riksbankens jubileumsfond) and the Royal Swedish Academy of Letters (Kungliga Vitterhetsakademien). The project combines philology, archaeology, linguistics, and information systems, and is comprised of several research, digitisation, and digital development components. Chief among these is the development of a web-based research platform for runic researchers, built on linked open data services, with the aim of drawing together disparate structured digital runic resources into a single convenient interface. As part of the platform's development, the corpus of Scandinavian runic inscriptions in Uppsala University's Runic Text Database will be restructured and marked up for use on the web, and linked against their entries in the previously digitised standard corpus work (Sveriges runinskrifter). In addition, photographic archives of runic inscriptions from the 19th- and 20th centuries from both the Swedish National Heritage Board archives and Uppsala University library will be digitised, alongside other hitherto inaccessible archive material.

As a collaboration between a university and a state heritage agency with a small research community as its primary target audience, the project must bridge the gap between the different needs and abilities of these stakeholders, as well as resolve issues of long-term maintenance and stability which have previously proved problematic for some of the source datasets in question. It is hoped that the resulting research- and data platforms will combine the strengths of both the National Heritage Board and Uppsala university to produce a rich, actively-maintained scholarly resource.

This paper will present the background and aims of the project within the context of runic research, as well as the various datasets that will be linked together in the research platform (via its corresponding linked data service) with particular focus on the data structures in question, the philological markup of the corpus of inscriptions, and requirements gathering.

Källström-Everlasting Runes-188_a.pdf
Källström-Everlasting Runes-188_c.pdf

5:15pm - 5:30pm
Distinguished Short Paper (10+5min) [abstract]

Designing a Generic Platform for Digital Edition Publishing

Niklas Liljestrand

Svenska litteratursällskapet i Finland r.f.,

This presentation describes the technical design for streamlining work with publishing Digital Editions on the web. The goal of the project is to provide a platform for scholars working with Digital Editions to independently create, edit, and publish their work. The platform is to be generic, but with set rules of conduct and processes, providing rich documentation of use.

The work on the platform started during 2016, with a rebuild of the website for Zacharias Topelius Skrifter for the mobile web (presented during DHN 2017, The work continues with building the responsive site to be easily customizable and suite the different Editions needs.

The platform will consist of several independent tools, such as tools for publishing, version comparison, editing, and tagging XML TEI formatted documents. Many of the tools are already available today, but they are heavily dependent on customization for each new edition and MS Windows only. For the existing tools, the project aims to combine, simplify and make the tools platform independent.

The project will be completed within 2018 and the aim is to publish all tools and documentation as open-source.

Liljestrand-Designing a Generic Platform for Digital Edition Publishing-236_a.pdf
Liljestrand-Designing a Generic Platform for Digital Edition Publishing-236_c.pdf
5:30pm - 7:00pmDHN Annual meeting
PII, Porthania
7:30pm - 10:00pmConference dinner
Restaurant Sipuli, Kanavaranta 7
Date: Friday, 09/Mar/2018
8:00am - 9:00amBreakfast
Lobby, Porthania
9:00am - 9:15amIntroduction to the Digital & Critical Friday
Think Corner 
9:15am - 10:30amPlenary 3: Caroline Bassett
Session Chair: Johanna Sumiala
‘In that we travel there’ – but is that enough?: DH and Technological Utopianism. Watchable also remotely from PII, PIV and P674.
Think Corner 
10:30am - 11:00amCoffee break
Lobby, Porthania
11:00am - 12:00pmF-PII-1: Creating and Evaluating Data
Session Chair: Koenraad De Smedt
11:00am - 11:15am
Short Paper (10+5min) [publication ready]

Digitisation and Digital Library Presentation System – A Resource-Conscientious Approach

Tuula Pääkkönen, Jukka Kervinen, Kimmo Kettunen

National Library of Finland, Finland

The National Library of Finland (NLF) has done long-term work to digitise and make available our unique collections. The digitisation policy defines what is to be digitised, and it aims not only to target both rare and unique materials but also to create a large corpus of certain material types. However, as digitisation resources are scarce, the digitisation is planned annually, where prioritisation is done. This involves the library juggling the individual researcher needs with its own legal preservation and availability goals. The digital presentation system at plays a key role, since it enables fast operation by being next to the digitisation process, and it enables a streamlined flow of material via a digital chain from production and to the end users.

In this paper, we will describe our digitisation process and its cost-effective improvements, which have been recently applied at the NLF. In addition, we evaluate how we could improve and enrich our digital presentation system and its existing material by utilising results and experience from existing research efforts. We will also briefly examine the positive examples of other national libraries and identify universal features and local differences.

Pääkkönen-Digitisation and Digital Library Presentation System – A Resource-Conscientious Approach-117_a.pdf
Pääkkönen-Digitisation and Digital Library Presentation System – A Resource-Conscientious Approach-117_c.pdf

11:15am - 11:30am
Short Paper (10+5min) [publication ready]

Digitization of the collections at Ømålsordbogen – the Dictionary of Danish Insular Dialects: challenges and opportunities

Henrik Hovmark, Asgerd Gudiksen

University of Copenhagen,

Ømålsordbogen (the Dictionary of Danish Insular Dialects, henceforth DID) is an historical dictionary giving thorough descriptions of the dialects, i.e. the spoken vernacular of peasants and fishermen, on the Danish isles Seeland, Funen and surrounding islands. It covers the period from 1750 to 1950, the core period being 1850 to 1920. Publishing began in 1992 and the latest volume (11, kurv-lindorm) appeared in 2013 but the project was initiated in 1909 and data collection dates back to the 1920s and 1930s. The project is currently undergoing an extensive process of digitization: old, outdated editing tools have been replaced with modern (database, xml, Unicode), and the old, printed volumes have been extracted to xml as well and are now searchable as a single xml file. Furthermore, the underlying physical data collections are being digitized.

In the following we give a brief account of the digitization process, and we discuss a number of questions and dilemmas that this process gives rise to. The collections underlying the DID project comprise a variety of subcollections characterized by a large heterogeneity in terms of form as well as content. The information on the paper slips are usually densified, often idiosyncratic, and normally complicated to decode, even for other specialists. The digitization process naturally points towards web publication of the collections, either alone or in combination with the edited data, but it also gives rise to a number of questions. The current digitization process being very basic, only adding very few metadata (1-2 or 3), we point to the obvious fact that web publication of the collections presupposes an addition of further, carefully selected metadata, taking different user needs and qualifications into account. We also discuss the relationship between edited and non-edited data in a publication perspective. Some of the paper slips are very difficult to decipher due to handwriting or idiosyncratic densification and we point out that web publication in a raw, i.e. non-edited or non-annotated form, might be more misleading than helpful for a number of users.

Hovmark-Digitization of the collections at Ømålsordbogen – the Dictionary of Danish Insular Dialects-186_a.pdf
Hovmark-Digitization of the collections at Ømålsordbogen – the Dictionary of Danish Insular Dialects-186_c.pdf

11:30am - 11:45am
Short Paper (10+5min) [abstract]

Cultural heritage collections as research data

Toby Burrows1,2

1University of Oxford; 2University of Western Australia

This presentation will focus on the re-use of data relating to collections in libraries, museums and archives to address research questions in the humanities. Cultural heritage materials held in institutional collections are crucial sources of evidence for many disciplines, ranging from history and literature to anthropology and art. They are also the subjects of research in their own right – encompassing their form, their history, and their content, as well as their places in broader assemblages like collections and ownership networks. They can be studied for their unique and individual qualities, as Neil McGregor demonstrated in his History of the World in 100 Objects, but also as components within a much larger quantitative framework.

Large-scale research into the history and characteristics of cultural heritage materials is heavily dependent on the availability of collections data in appropriate formats and sufficient quantities. Unfortunately, this kind of research has been seriously limited, for the most part, by lack of access to suitable curatorial data. In some instances this is simply because collection databases have not been made fully available on the Web – particularly the case with art galleries and some museums. Even where databases are available, however, they often cannot be downloaded in their entirety or through bulk selections of relevant content. Data downloads are frequently limited to small selections of specific records.

Collections data are often available only in formats which are difficult to re-use for research purposes. In the case of libraries, the only export formats tend to be proprietary bibliographic schemas such as EndNote or RefCite. Even where APIs are made available, they may be difficult to use or limited in their functionality. CSV or XML downloads are relatively rare. Data licensing regimes may also discourage re-use, either by explicit limitations or by lack of clarity about terms and conditions.

Even where researchers are able to download usable data, it is very rare for them to be able to feed back any cleaning or enhancing they may have done. The cultural heritage institutions supplying the data may be unable or unwilling to accept corrections or improvements to their records. They may also be suspicious of researchers developing new digital services which appear to compete with the original database.

As a result, there has been a significant disconnect between curatorial databases and researchers, who have struggled to make effective use of what is potentially a very rich source of computationally usable evidence. One important consequence is that re-use of curatorial data by researchers often focuses on the data which are the easiest to obtain. The results are neither particularly representative nor exhaustive, and may weaken the validity of the conclusions drawn from the research.

Some recent “collections as data” initiatives (such as have started to explore approaches to best practice for “computationally amenable collections”, with the aim of “encouraging cultural heritage organizations to develop collections and systems that are more amenable to emerging computational methods and tools”. In this presentation, I will suggest some elements of best practice for curatorial institutions in this area.

My observations will be based on three projects which are addressing these issues. The first project is “Collecting the West”, in which Western Australian researchers are working with the British Museum to deploy and evaluate the ResearchSpace software, which is designed to integrate heterogeneous collection data into a cultural heritage knowledge graph. The second project is HuNI – the Humanities Networked Infrastructure – which has been building a “virtual laboratory” for the humanities by reshaping collections data into semantic information networks. The third project – “Reconstructing the Phillipps Collection”, funded by the European Union under its Marie Curie Fellowships scheme – involved combining collections data from a range of digital and physical sources to reconstruct the histories of manuscripts in the largest private collection ever assembled.

Curatorial institutions should recognize that there is a growing group of researchers who do not simply want to search or browse a collections database. There is an increasing demand for access to collections data for downloading and re-use, in suitable formats and on non-restrictive licensing terms. In return, researchers will be able to offer enhanced and improved ways of analyzing and visualizing data, as well as correcting and amplifying collection database records on the basis of research results. There are significant potential benefits for both sides of this partnership.

Burrows-Cultural heritage collections as research data-124_a.pdf
Burrows-Cultural heritage collections as research data-124_c.pdf
11:00am - 12:00pmF-PIV-1: Manuscripts, Collections and Geography
Session Chair: Asko Nivala
11:00am - 11:15am
Distinguished Short Paper (10+5min) [abstract]

Big Data and the Afterlives of Medieval and Renaissance Manuscripts

Toby Burrows1,2, Lynn Ransom3, Hanno Wijsman4, Eero Hyvönen5,6

1University of Oxford; 2University of Western Australia; 3University of Pennsylvania; 4Institut de recherche et d'histoire des textes; 5Aalto University; 6University of Helsinki

Tens of thousands of European medieval and Renaissance manuscripts have survived until the present day. As the result of changes of ownership over the centuries, they are now spread all over the world, in collections across Europe, North America, Asia and Australasia. They often feature among the treasures of libraries, museums, galleries, and archives, and they are frequently the focus of exhibitions and events in these institutions. They provide crucial evidence for research in many disciplines, including textual and literary studies, history, cultural heritage, and the fine arts. They are also objects of research in their own right, with disciplines such as paleography and codicology examining the production, distribution, and history of manuscripts, together with the people and institutions who created, used, owned, and collected them.

Over the last twenty years there has been a proliferation of digital data relating to these manuscripts, not just in the form of catalogues, databases, and vocabularies, but also in digital editions and transcriptions and – especially – in digital images of manuscripts. Overall, however, there is a lack of coherent, interoperable infrastructure for the digital data relating to these manuscripts, and the evidence base remains fragmented and scattered across hundreds, if not thousands, of data sources.

The complexity of navigating multiple printed sources to carry out manuscript research has, if anything, been increased by this proliferation of digital sources of data. Large-scale analysis, for both quantitative and qualitative research questions, still requires very time-consuming exploration of numerous disparate sources and resources, including manuscript catalogues and databases of digitized manuscripts, as well as many forms of secondary literature. As a result, most large-scale research questions about medieval and Renaissance manuscripts remain very difficult, if not impossible, to answer.

The “Mapping Manuscript Migrations” project, funded by the Trans-Atlantic Platform under its Digging into Data Challenge for 2017-2019, aims to address these needs. It is led by the University of Oxford, in partnership with the University of Pennnsylvania, Aalto University in Helsinki, and the Institut de recherche et d’histoire des textes in Paris. The project is building a coherent framework to link manuscript data from various disparate sources, with the aim of enabling searchable and browsable semantic access to aggregated evidence about the history of medieval and Renaissance manuscripts.

This framework is being used as the basis for a large-scale analysis of the history and movement of these manuscripts over the centuries. The broad research questions being addressed include: how many manuscripts have survived; where they are now; and which people and institutions have been involved in their history. More specific research focuses on particular collectors and countries.

The paper will report on the first six months of this project. The topics covered will include the new digital platform being developed, the sources of data which are being combined, the data modeling being carried out to link disparate data sources, the research questions which this assemblage of big data is being used to address, and the ways in which this evidence can be presented and visualized.

Burrows-Big Data and the Afterlives of Medieval and Renaissance Manuscripts-125_a.pdf
Burrows-Big Data and the Afterlives of Medieval and Renaissance Manuscripts-125_c.pdf

11:15am - 11:30am
Short Paper (10+5min) [abstract]

The World According to the Popes: A Geographical Study of the Papal Documents, 2005–2017

Roger Mähler, Fredrik Norén

Umeå University, Sweden,

This paper seeks to explore what an atlas of the popes would be like. Can one study places in texts to map out latent meanings of the Vatican’s political and religious ambitions, and to anticipate evolving trends? Could spatial analysis be a key to better understand a closed institution such as the papacy?

The Holy See is often associated with conservative stability. The papacy has, after all, managed to prevail while states and supranational organizations have come and gone. At the same time, the Vatican has shown remarkable capacity to adapt to scientific findings as well as a changing worldview. This complexity also reflects the geopolitical strategies of the catholic church. For centuries the Vatican has been conscious of geography and politics as key aspects in order to strengthen the Holy See and secure its position on the international scene. During the twentieth century, for example, the church state expanded its global presence. When John Paul II was elected pope in 1978, the Vatican City had full diplomatic ties with 85 states. In 2005, when Benedict XVI was elected, that number had increased to 176. Moreover, the papacy has now formal diplomatic relations with the European Union, and is represented as a permanent observer to various global organizations including United Nations, the African Union, the World Trade Organization, and has even obtained a special membership in the Arabic League (Agnew, 2010; Barbato, 2012). In fact, the emergence of an international public sphere and a global stage have been utilized by the Holy See, and significantly increased its soft power (Barbato, 2012).

As the geopolitical conditions, and ambitions of the Vatican City are changing what happens with its perception of the world, certain regions, and places? Does the relationship between cities, countries, and regions constitute fixed historical patterns, or are these geographical structures evolving, and changing as a new pope is elected? Inspired by Franco Moretti, this study departs from the notion that making connections between places and texts “will allow us to see some significant relationships that have so far escaped us” (Moretti, 1998: 3). The basis of the analysis is all English translated papal documents from Benedictus XVI (2005–2013) and Francis (2013–), retrieved from the Vatican webpage (

Methodological Preparations: Scraping Data and Extracting Entities

From a technical point of view, the empirical material used in this study has been prepared in three steps. First, all web page documents in English have been downloaded, and the (proper) text in each document has been extracted and stored. Secondly, the places mentioned in each text document have been identified and extracted using the Stanford Named Entity Recognizer (NER) software. Thirdly, the resulting list of places has been manually reduced by merging name variations of the same place (e.g. “Sweden” and “Kingdom of Sweden”).

The Vatican's communication strategies differ from, let’s say, those of the daily press or the parliamentary parties, in the sense that they have a thousand-year perspective, or work from the point of view of eternity (Hägg, 2007). This is reflected on the Vatican’s webpage, which is immensely informative. Text material from all popes since the late nineteenth century are publicly accessible online, ranging from letters, speeches, bulls to encyclicals, and all with a high optical character recognition (OCR) quality. Since the Holy See always has been a, according Göran Hägg, “mediated one man show”, it makes sense to focus on a corpus of texts written or spoken by the popes in order to study the Vatican’s notion of, basically, everything (Hägg, 2007: 239). The period 2005 to 2016 is pragmatically chosen because of its comprehensive volume of English translated papal documents. Before this period, as Illustration 1 shows, you basically need to master Latin or Italian. While, for example, the English texts from John Paul II (1978–2005) equals to two million words, the corpus of Benedictus XVI (2005–2013) together with current pope Francis sum up to near 59 million words, spread over some 5000 documents.

Illustration 1. The table shows the change in English translated text material available at the Vatican webpage.

The text documents were extracted, or “scraped”, from the Vatican web site using scripts written in the Python programming language. The Scrapy library was used to “crawl” the web site, that is, to follow links of interest, starting from each Pope’s home page, and download each web page that contains a document in English. The site traversal (crawling) was governed by a set of rules specifying what links to follow and what target web pages (documents) to download. The links (to follow) included all links in the left side navigation menu on the Pope’s home page, and the “paging” links in each referenced page. These links were easily identified using commonalities in the link URL’s, and the web pages with the target text documents (in HTML) were likewise identified by links matching the pattern “.../content/name-of-pope/en/.../documents/”. The BeautifulSoap Python library was finally used to extract and cleanse the actual text from the downloaded web pages. (The text was easily identified by a ‘.documento” CSS class.)

In the next step we ran the Stanford Named Entity Recognizer on the collected text material. This software is developed by the Stanford Natural Language Processing Group, and is regarded as one of the most robust implementation of named entity recognition, that is the task of finding, classifying and extracting (or labeling) “entities” within a text. Stanford NER uses a statistical modeling method (Conditional Random Fields, CRFs), has multiple language support, and includes several pre-trained classifier models (new models can also be trained). This study used one of the pre-trained models, the 3 class model (location, person and organization) trained on data from CoNLL 2003 (Reuters Corpus), MUC 6 and MUC 7 (newswire), ACE (newswire, broadcast news), OntoNotes (various sources including newswire and broadcast news) and Wikipedia. (This is the reason why “Hell” was not identified as a place, or why “God” rarely was a person, nor a place. However, since the first two parts of the analysis will focus on what could be labeled as “earthly geography”, this was not considered a problem for the analysis.) Stanford NER tags each identified entity in the input text with the corresponding classifier. These tagged entities were then extracted from the entire text corpus and stored in a single spreadsheet file, aggregated on the number of occurrences per entity and document. (The stored columns were document name, document year, type of document, name of pope, entity, entity classifier, and number of occurrences.)

Even though some of the places identified by Stanford NER were difficult to assess whether they were in fact persons or organizations, they were still kept for the analysis. Furthermore, abstract geographical entities such as ”East”, or very specific ones (but still difficult to geographically identify) like ”Beautiful Gate of the Temple”, or an entity like ”Rome-Byzantium-Moscow”, which could be interpreted as a historic political alliance; all these places were kept for the analysis. After all, in this study the interest lies in the general connections between places, not the rare ones, which easily disappear in the larger patterns.

Papa Analytics

Based on the methodological preparations, the analysis consists of three parts, using different methods, of which the first two parts will utilize the identified place entities. First, the study introduces the spatial world of the recent papacy, using simpler methods to trace, for example, what places occur in the texts, their frequencies, their divisions, whether geopolitical or sacred, which places are the most dominating etc. Furthermore, how the geographical density has changed over time, that is, how many places (total or unique ones) are mentioned per documents or per 1000 words.

Secondly, the analysis studies the clusters of “co-occurring” places, based on places mentioned in the same document. Since most individual papal texts are dedicated to a certain topic, one can assume that places in a document have something in common. The term frequency-inverse document frequency (tf-idf) weighting is used as a measure of how important a place is in a specific document, and this weight is used in the co-occurrence computation. This unfolds the latent geographical network, as it is articulated by the papacy, with centers and peripheries, and both sacred and geopolitical aspects.

Last but not least, this study tries map the space of the divine, as it is expressed through Benedictus XVI and pope Francis, using word2vec, a method developed by a team at Google in 2013, to produce word embeddings (Mikolov et al, 2013). Simply put, the algorithm positions the vocabulary of a corpus in a high-dimensional vector space based on the assumption that “words which are similar in meaning occur in similar contexts” (Rubenstein & Goodenough, 1965: 627). This enables the use of basic numerical methods to compute word (dis-)similarities, to find clusters of similar words, or to create scales on how (subsets of) words are related to certain dichotomies. This study investigates dichotomies such as “Heaven” and “Hell”, “Earth” and “Paradise”, or “God” and “Satan”. Hence, the third part of the study also seeks to relate the earthly geography with the religious space as articulated by the papacy.


Agnew, J. (2010). Deus Vult: The Geopolitics of the Catholic Church. Geopolitics, 15(1), 39–61.

Barbato, M. (2012). Papal Diplomacy : The Holy See in World Politics. IPSA XXII World Conference of Political Science, (2003), 1–29.

Finkel, J.R. Grenager, T., and Manning, C. (2005). Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370.

Florian, R., Ittycheriah, A., Jing, H. and Zhang, T. (2003) Named Entity Recognition through Classifier Combination. Proceedings of CoNLL-2003. Edmonton, Canada.

Hägg, G. (2007). Påvarna : två tusen år av makt och helighet. Stockholm: Wahlström & Widstrand.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space, 1–12.

Moretti, F. (1998). Atlas of the european novel: 1800–1900. New York: Verso.

Rodriquez, K. J., Bryant, M., Blanke, T., & Luszczynska, M. (2012). Comparison of Named Entity Recognition tools for raw OCR text. Proceedings of KONVENS 2012 (LThist 2012 Workshop), 2012, 410–414.

Rubenstein, H., & Goodenough, J. B. (1965). Contextual correlates of synonymy. Communications of the ACM, 8(10), 627–633.

Mähler-The World According to the Popes-145_a.pdf
Mähler-The World According to the Popes-145_c.pdf

11:30am - 11:45am
Short Paper (10+5min) [abstract]

Ownership and geography of books in mid-nineteenth century Iceland

Örn Hrafnkelsson

National and University Library of Iceland,

In October 1865, the national librarian and the only employee of the National Library of Iceland (est. 1818) got the permission from the bishop in Iceland to send out a written request to all provosts around the country to do a detailed survey in there parishes of ownership of old Icelandic books printed before 1816. Title page of each book in every farm should be copied in full detail with line-breaks and ornaments, number of printed pages, place of publication etc.

The aim of this five years project was to compile data for a detailed national bibliography and list of Icelandic authors to build up a good collection of books in the library.

Many of the written reports have survived and are now in the library archive. In my paper, I will talk about these unused sources of ownership of books in every farm in Iceland, how Icelandic book history can now be interpreted in a new and different way and most importantly how we are using these sources with other data to display how ownership of books in the nineteenth century for example varied from different parts of the country. Which books, authors or titles were more popular than other, how many copies have survived, did books related to the Icelandic Enlightenment have any success, did books of some special genres have more chance of survival than others etc.

This is done by using several authority files that have been made in the library for other projects and are in TEI P5 XML. Firstly, a detailed historical bibliography of Icelandic books from 1534 to 1844 and secondly a list of all farms in Iceland with GPS coordinates.

I will also elaborate on this project about ownership of books and geography of books can be developed further and the data can be of use for others. One aspect of my talk is the cooperation between librarians, academics and IT professionals and how unrelated sources can be linked together to bring out new knowledge and interpret history.

Projects website:

Hrafnkelsson-Ownership and geography of books in mid-nineteenth century Iceland-109_a.docx

11:45am - 12:00pm
Distinguished Short Paper (10+5min) [publication ready]

Icelandic Scribes: Results of a 2-Year Project

Sheryl McDonald Werronen

University of Copenhagen,

This paper contributes to the conference theme of History and introduces an online catalogue that recreates an early modern library: the main digital output of the author’s individual research project “Icelandic Scribes” (2016–2018 at the University of Copenhagen). The project has investigated the patronage of manuscripts by Icelander Magnús Jónsson í Vigur (1637–1702), his network of scribes and their working practices, and the significance of the library of hand-written books that he accumulated during his lifetime, in the region of Iceland called the Westfjords. The online catalogue is meant to be a digital resource that reunites this library virtually, gives detailed descriptions of the manuscripts, and highlights the collection’s rich store of texts and the individuals behind their creation. The paper also explores some of the challenges of integrating new data produced by this and other small projects like it with existing online resources in the field of Old Norse-Icelandic studies.

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 654825.

McDonald Werronen-Icelandic Scribes-159_a.pdf
McDonald Werronen-Icelandic Scribes-159_c.pdf
11:00am - 12:00pmF-P674-1: Teaching and Learning the Digital
Session Chair: Maija Paavolainen
11:00am - 11:15am
Short Paper (10+5min) [publication ready]

Creative Coding at the arts and crafts school Robotti (Käsityökoulu Robotti)

Tomi Dufva

Aalto-University, the school of Arts, Design and Architecture,

The increasing use of digital technologies presents a new set of challenges that, in addition to key economic and societal viewpoints, also reflects similar use in both education and culture. On the other hand, instead of a challenge, digitalization of our environment can also be seen as new material and a new medium for art and art education. This article suggests that both a better understanding of digital structures, and the ability for greater self-expression through digital technology is possible using creative coding as a teaching method.

This article focuses on Käsityökoulu Robotti (, a type of hacker space for children that offers children teaching about art and technology. Käsityökoulu Robotti is situated within the contexts of art education, the maker movement, critical technology education, and media art. Art education is essential to Käsityökoulu Robotti in a bilateral sense, i.e., to discover in what ways art can be used to create clearer understanding of technology and at the same time teach children how to use new technological tools as a way to greater self-expression. These questions are indeed intertwined, as digital technology, like code, can be a substantial way to express oneself in ways that otherwise could not be expressed. Further, using artistic approaches, such as creative coding, can generate more tangible knowledge of digital technology. A deeper understanding of digital technology is also critical when dealing with the ever-increasing digitalization of our society, as it helps society to understand the digital structures that underlie our continually expanding digital world.

This article examines how creative coding works as a teaching method in Käsityökoulu Robotti to promote both artistic expression and a critical understanding of technology. Further still, creative coding is a tool for bridging the gap between maker movement, critical thinking and art practices and bring each into sharper focus. This discussion is the outcome of an ethnographic research project at Käsityökoulu Robotti.

Dufva-Creative Coding at the arts and crafts school Robotti-101_a.pdf

11:15am - 11:30am
Distinguished Short Paper (10+5min) [abstract]

A long way? Introducing digitized historic newspapers in school, a case study from Finland

Inés Matres

University of Helsinki

During 2016/17 two Finnish newspapers, from their first issue to their last, were made available to schools in eastern Finland through the digital collections of the National Library of Finland ( This paper presents the case study of one upper-secondary class making use of these materials. Before having access to these newspapers, the teachers in the school in question had little awareness of what this digital library contained. The initial research questions of this paper are whether digitised historic newspapers can be used by school communities, and what practices they enable. Subsequently, the paper explores how these practices relate to teachers’ habits and to the wider concept of literacy, that is, the knowledge and skills students can acquire using these materials. To examine the significance of historic newspapers in the context of their use today, I rely on the concept of ‘practice’ defined by cultural theorist Andreas Reckwitz as the “use of things that ‘mould’ activities, understandings and knowledge”.

To correctly assess practice, I approached this research through ethnographic methods, constructing the inquiry with participants in the research: teachers, students and the people involved in facilitating the materials. During 2016, I conducted eight in-depth interviews with teachers about their habits, organized a focus group with further 15 teachers to brainstorm activities using historic newspapers, and collaborated closely with one language and literature teacher, who implemented the materials in her class right away. Observing her students work and hearing their presentations, motivations, and opinions about the materials showed how students explored the historical background of their existing personal, school-related and even professional interests. In addition to the students’ projects, I also collected their newspaper clippings and logs of their searches in the digital library. These digital research assets revealed how the digital library that contains the historic newspapers influenced the students’ freedom to choose a topic to investigate and their capacity to ‘go deep’ in their research.

The findings of this case study build upon, and extend, previous research about how digitized historical sources contribute in upper-secondary education. The way students used historical newspapers revealed similarities with activities involving contemporary newspapers, as described by the teachers who participated in this study. Additionally, both the historicity and the form of presentation of newspapers in a digital library confer unique attributes upon these materials: they allow students to explore the historical background of their research interests, discover change across time, verbalize their research ideas in a concrete manner, and train their skills in distant and close reading to manage large amounts of digital content. In addition to these positive attributes that connect with learning goals set by teachers, students also tested the limits of these materials. The lack of metadata in articles or images, the absence of colour in materials that originally have it, or the need for students to be mindful of how language has changed since the publication of the newspapers are constrains that distinguish digital libraries from resources, such as web browsers and news sites, that are more familiar to students. Being aware of these positive and negative affordances, common to digital libraries containing historic newspapers and other historical sources, can support teachers in providing their students effective guidelines when using this kind of materials.

This use case demonstrates that digitized historical sources in education can do more than simply enabling students to “follow the steps of contemporary historians”, as research has previously established. These materials could also occupy a place between history and media education. The objective of media education in school –regardless of the technological underpinnings of a single medium, which change rapidly in this digital age– aims at enabling students to reflect on the processes of media consumption and production. The contribution of digitized historical newspapers to this subject is acquainting students with processes of media preservation and heritage. However, it could still be a long way until teachers adopt these aspects in their plans. It is necessary to acknowledge the trajectory and agents involved, since the 1960s, in the work of introducing newspapers in education. This task not only consisted of facilitating access to newspapers, but also of developing teaching plans and advocating for a common understanding and presence of media education in schools.

In addition to uncovering an aspect of digital cultural heritage that is relevant for the school community today, another aim of this paper is to raise awareness among the cultural heritage community, especially national libraries, about the diversity in the uses and users of their collections, especially in a time when the large-scale digitization of special collections is generalizing access to materials traditionally considered for academic research.

Selected bibliography:

Buckingham, D. (2003). Media education: literacy, learning, and contemporary culture. Polity Press.

Gooding, P. (2016). Historic Newspapers in the Digital Age: ‘Search All About It!’ Routledge.

Lévesque, S. (2006). Discovering the Past: Engaging Canadian Students in Digital History. Canadian Social Studies, 40(1).

Martens, H. (2010). Evaluating Media Literacy Education: Concepts, Theories and Future Directions. Journal of Media Literacy Education, 2(1).

Nygren, T. (2015). Students Writing History Using Traditional and Digital Archives. Human IT, 12(3), 78–116.

Reckwitz, A. (2002). Toward a Theory of Social Practices: A Development in Culturalist Theorizing. European Journal of Social Theory, 5(2), 243–263.

Matres-A long way Introducing digitized historic newspapers-205_a.pdf

11:30am - 11:45am
Short Paper (10+5min) [abstract]

“See me! Not my gender, race, or social class”: Combating Stereotyping and prejudice mixing digitally manipulated experience with classroom debriefing.

Anders Steinvall1, Mats Deutschmann2, Mattias Lindvall-Östling2, Jon Svensson3, Roger Mähler3

1Department of Language Studies, Umeå University, Sweden; 2School of Humanities, Education and Social Sciences, Örebro University, Sweden; 3Humlab, Umeå University, Sweden


Not only does stereotyping, based on various social categories such as age, social class, ethnicity, sexuality, regional affiliation, and gender serve to simplify how we perceive and process information about individuals (Talbot et al. 2003: 468), it also builds up expectations on how we act. If we recognise social identity as an ongoing construct, and something that is renegotiated during every meeting between humans (Crawford 1995), it is reasonable to speculate that stereotypic expectations will affect the choices we make when interacting with another individual. Thus, stereotyping may form the basis for the negotiation of social identity on the micro level. For example, research has shown that white American respondents react with hostile face expressions or tone of voice when confronted with African American faces, which is likely to elicit the same behaviour in response, but, as Bargh et al. point out (1996: 242), “because one is not aware of one's own role in provoking it, one may attribute it to the stereotyped group member (and, hence, the group)”. Language is a key element in this process. An awareness of such phenomena, and how we unknowingly may be affected by the same, is, we would argue, essential for all professions where human interaction is in focus (psychologists, teachers, social workers, health workers etc.).

RAVE (Raising Awareness through Virtual Experiencing) funded by the Swedish Research Council, aims to explore and develop innovative pedagogical methods for raising subjects’ awareness of their own linguistic stereotyping, biases and prejudices, and to systematically explore ways of testing the efficiency of these methods. The main approach is the use of digital matched-guise testing techniques with the ultimate goal to create an online, packaged and battle-tested, method available for public use.

We are confident that there is a place for this, in our view, timely product. There can be little doubt that the zeitgeist of the 21st centuries first two decades has swung the pendulum in a direction where it has become apparent that the role of Humanities should be central. In times when unscrupulous politicians take every chance to draw on any prejudice and stereotypical assumptions about Others, be they related to gender, ethnicity or sexuality, it is the role of the Humanities to hold up a mirror and let us see ourselves for what we are. This is precisely the aim of the RAVE project.

In line with this thinking, open access to our materials and methods is of primary importance. Here our ambition is not only to provide tested sample cases for open access use, but also to provide clear directives on how these have been produced so that new cases, based on our methods, can be created. This includes clear guidelines as to what important criteria need to be taken into account when so doing, so that our methodology is disseminated openly and in such a fashion that it becomes adaptable to new contexts.


The RAVE method at its core relies on a treatment session where two groups of test subjects (i.e. students) each are exposed to one out of two different versions of the same scripted dialogue. The two versions differ only with respect to the perception of the gender of the characters, whereas scripted properties remain constant. In one version, for example, one participant, “Terry”, may sound like a man, while in the other recording this character has been manipulated for pitch and timbre to sound like a woman. After the exposure, the subjects are presented with a survey where they are asked to respond to questions related to linguistic behaviour and character traits one of the interlocutors. The responses of the two sub-groups are then compared and followed up in a debriefing session, where issues such as stereotypical effects are discussed.

The two property-bent versions are based on a single recording, and the switch of the property (for instance, gender) is done using digital methods described below. The reason for this procedure is to minimize the number of uncontrolled variables that could affect the outcome of the experiment. It is a very difficult - if not an impossible - task to transform the identity-related aspects of a voice recording, such as gender or accent, while maintaining a “perfect” and natural voice - a voice that is opposite in the specific aspect, but equivalent in all other aspects, and doing so without changing other properties in the process or introducing artificial artifacts.

Accordingly, the RAVE method doesn’t strive for perfection, but focuses on achieving a perceived credibility of the scripted dialogue. However, the base recording is produced with a high quality to provide the best possible conditions for the digital manipulation. For instance, the dialogue between the two speakers are recorded on separate tracks so as to keep the voices isolated.

The digital manipulation is done with the Praat software (Boersma & Weenink, 2013). Formants, range and and pitch median are manipulated for gender switching using standard offsets and are then adapted to the individual characteristics of the voices. Several versions of the manipulated dialogues are produced, and evaluated by a test group via an online survey. Based on the survey result, the one with the highest quality is selected. This manipulated dialogue needs further framing to reach a sufficient level of credibility.

The way the dialogue is framed for the specific target context, how it is packaged and introduced is of critical importance. Various kinds of techniques, for instance use of audiovisual cues, are used to distract the test subject from the “artificial feeling”, as well as to enforce the desired target property. We add various kinds of distractions, both audial and visual, which lessen the listeners’ focus on the current speaker, such as background voices simulating the dialogue taking place in a cafe, traffic noise, or scrambling techniques simulating, for instance, a low-quality phone or a Skype call.

On this account, the RAVE method includes a procedure to evaluate the overall (perceived) quality and credibility of a specific case setup.This evaluation is implemented by exposing a number of pre-test subjects to the packaged dialogue (in a set-up comparable to the target context). After the exposure, the pre-test subjects respond to a survey designed to measure the combined impression of aspects such as the scripted dialogue, the selected narrators, the voices, the overall set-up, the contextual framing etc.

The produced dialogues, and accompanying response surveys are turned into a single online package using the program Storyline. The single entry point to the package makes the process of collecting anonymous participant responses more fail-safe and easier to carry out.

The whole package is produced for a “bring your own device” set-up, where the participants use their own smart phones, tablets or laptops to take part in the experiment. These choices of using an online single point of entry package adapted to various kinds of devices have been made to facilitate experiment participation and recording of results. The results from the experiment is then collected by the teacher and discussed with the students at an ensuing debriefing seminar.


At this stage, we have conducted experiments using the RAVE method with different groups of respondents, ranging from teacher trainees, psychology students, students of sociology, active teachers, the public at large etc, in Sweden and elsewhere. Since the experiments have been carried out in other cultural contexts (in the Seychelles, in particular), we have received results that enable cross-cultural comparisons.

All trials conducted addressing gender stereotyping have supported our hypothesis that linguistic stereotyping acts as a filter. In trials conducted with teacher trainees in Sweden (n = 61), we could show that respondents who listened to the male guise overestimated stereotypical masculine conversational features such as how often the speaker interrupted, how much floor space ‘he’ occupied, and how often ‘he’ contradicted his counterpart. On the other hand, features such as signalling interest and being sympathetic were overestimated by the respondents when listening to the female guise.

Results from the Seychelles have strengthened our hypothesis. Surveys investigating linguistic features associated with gender showed that respondents’ (n=46) linguistic gender stereotyping was quite different from that of Swedish respondents. For example, the results from the Seychelles trials showed that floor space and the number of interruptions made were overestimated by the respondents listening to the female guise, quite unlike the Swedish respondents, but still in line with our hypothesis.

Trials using psychology students (n=101) have similar results. In experiments where students were asked to rate a case character’s (‘Kim’) personality traits and social behaviour, our findings show that the male version of Kim was deemed more unfriendly and a bit careless compared to the female version of Kim, who was regarded to be more friendly and careful. Again, this shows that respondents overestimate aspects that confirm their stereotypic preconceptions.


The underlying pedagogical idea for the set-up is to confront students and other participants with their own stereotypical assumptions. In our experience, discussing stereotypes with psychology and teacher training students does not give rise to the degree of self-reflection we would like. This is what we wanted to remedy. With the method described here, where the dialogues are identical except for the manipulation in terms of pitch and timbre, perceived differences in personality and social behaviour can only be explained as residing in the beholder.

A debriefing seminar after the exposure gave the students an opportunity to reflect on the results from the experiment. They were divided into mixed groups where half the students had listened to and responded to the male guise, and the other half to the female guise. Since any difference between the groups was the result of the participants’ rating, their own reactions to the conversations, there was something very concrete and urgent to discuss. Thus, the experiment affected the engagement positively. Clearly, the concrete and experiential nature of this method made the students analyze the topic, their own answers, the reasons for these and, ultimately, themselves in greater detail and depth in order to understand the results from the experiment, and try to relate the results to earlier research findings. Judging from these impressions, the method is clearly very effective.

Answers from a survey with psychology students (n=101) after the debriefing corroborate this impression. In response to the question “What was your general experience of the experiment that you have just partaken in? Did you learn anything new?”, a clear majority of the students responded positively: 76 %. Moreover, close to half of these answers explicitly expressed self-reflective learning. Of the remaining comments, 15 % were neutral, and 9 % expressed critical feedback.

Examples of responses expressing self-reflection include: “… It gave me food for thought. Even though I believed myself to be relatively free of prejudice I can't help but wonder if I make assumptions about personalities merely from the time of someone's voice.” And: “I learned some of my own preconceptions and prejudices that I didn't know I had.” An example of a positive comment with no self-reflective element is: “Female and male stereotypes were stronger than I expected, even if only influenced by the voice”,

The number of negative comments was small. The negative comments generally took the position that the results were expected so there was nothing to discuss, or that the student had figured out the set-up from the beginning. A few negative comments revealed that the political dimension of the subject of gender could influence responses. These students would probably react in the same way to a traditional seminar. We haven’t been able to reach everyone … yet ...

Steinvall-“See me! Not my gender, race, or social class”-194_a.pdf

11:45am - 12:00pm
Short Paper (10+5min) [abstract]

Digital archives and the learning processes of performance art

Tero Nauha

University of Helsinki

In this presentation, the process of learning performance art is articulated in the contextual change that digital archives have caused starting from the early 1990s. It is part of my postdoctoral research, artistic research on the conjunctions between divergent gestures of thought and performance, done in a research project How to Do Things with Performance? funded by the Academy of Finland.

Since performance art is a form of ‘live art’, it would be easy to regard that the learning processes are also mostly based on the physical practice and repetition. However, in my regard, performance art is a significant line of flight from the 1960’s and 70’s conceptual art, alongside the video-art. Therefore, the pedagogy of performance art has been tightly connected with the development of media from the collective use of the Portapak video cameras and the recent development of VR-attributed performances, or choreographic archive methods by such figures like William Forsythe, or the digital journals of artistic research like Ruukku-journal or Journal for Artistic Research, JAR.

This presentation will speculate on the transformation of performance art practices, since when the vast amount of historical archive materials has become accessible to artists, notwithstanding the physical location of a student or an artist. At the same time the social media affects the peer groups of artists. My point of view is not based on statistics, but on the notions that I have gathered from the teaching of performance art, as well as instructing MA and PhD level research projects.

The argument is that the emphasis on learning in performative practices is not based on talent, but it rather is general and generic, where the access to networks and digital archives serve as a tool for social form of organization. Or speculation on what performance art is? In this sense, and finally my argument is that the digital virtuality does not conflate with the concept of the virtual. On this, my argument leans on the philosophical thought on actualization and the virtual by Henri Bergson, Gilles Deleuze and Alexander R. Galloway. The access to the digital archives in the learning processes is rather based on the premise that artistic practices are explicitly actualizations of the virtual, already. The digitalization is a modality of this process.

The learning process of performance art is not done through resemblance, but doing with someone or something else and developed in heterogeneity with digital virtualities.

Nauha-Digital archives and the learning processes of performance art-152_a.pdf
11:00am - 12:00pmF-TC-1: Data, Activism and Transgression
Session Chair: Marianne Ping Huang
Think Corner 
11:00am - 11:30am
Long Paper (20+10min) [abstract]

Shaping data futures: Towards non-data-centric data activism

Minna Ruckenstein1, Tuukka Lehtiniemi2

1Consumer Society Research Centre, University of Helsinki, Finland,; 2HIIT, Aalto University

The social science debate that attends to the exploitative forces of the quantification of aspects of life previously experienced in qualitative form, recognising the ubiquitous forms of datafied power and domination, is by now an established perspective to question datafication and algorithmic control (Ruckenstein and Schüll, 2017). Drawing from the critical political economy and neo-Foucauldian analyses researchers have explored the effects of the datafication (Mayer-Schönberger and Cukier. 2013; Van Dijck, 2014) on the economy, public life, and self-understanding. Studies alert us to threats to privacy posed by “dataveillance” (Raley, 2012; Van Dijck, 2014), forms of surveillance distributed across multiple interested parties, including government agencies, insurance payers, operators, data aggregators, analytics companies, and individuals who provide the information either knowingly or unintentionally when going online, using self-tracking devices, loyalty programs, and credit cards. The “data traces” add to the data accumulated in databases and personal data – any data related to a person or resulting from actions by a person – becomes utilized for business and societal purposes in an increasingly systematic matter (Van Dijck and Poell, 2016; Zuboff, 2015).

In this paper, we take an “activist stance”, aiming to contribute to the current criticism of datafication with a more participatory and collaborative approach offered by “data activism” (Baack 2015; Milan and van der Velden, 2016), and civic and political engagement spurred by datafication. The various data-driven initiatives currently under development suggest that the problematic aspects of datafication, including the tension between data openness and data ownership (Neff, 2013), the asymmetries in terms of data usage and distribution (Wilbanks and Topol, 2016; Kish and Topol, 2015) and the inadequacy of existing informed consent and privacy protections (Sharon, 2016) are by now not only well recognized, but they are generating new forms of civic and political engagement and activism. This calls for more debate on what these new forms of data activism are and how scholars in the humanities and social science communities can assess them.

By relying on the approaches developed within the field of Techno-Anthropology (Børsen and Botin, 2013; Ruckenstein and Pantzar, 2015), seeking to translate and mediate knowledge concerning complex technoscientific projects and aims, we positioned ourselves as “outside insiders” with regard to a data-centric initiative called MyData. In 2014, we became observers and participants of the MyData, promoting the understanding that people benefit when they can control data gathering and analysis by public organizations and businesses and become more active data citizens and consumers. The high-level MyData vision, described in ‘the MyData white paper’ written primarily by researchers at the Helsinki Institute for Information Technology and the Tampere University of Technology (Poikola et al., 2015), outlines an alternative future that transforms the ’organisation-centric system‘ into ’a human-centric system‘ that treats personal data as a resource that the individual can access, control, benefit and learn from.

The paper discusses “our” data activism and the activism of technology developers, promoting and relying on two different kinds of “social imaginaries” (Taylor, 2004). By doing so, we open a perspective to data activism that highlights ideological and political underpinnings of contested social imaginaries and aims. Current data-driven initiatives tend to proceed with a social imaginary that treats data arrangements as solutions, or corrective measures addressing unsatisfactory developments. They advance a logic of an innovation culture, relying on the development of new technology structures and computationally intensive tools. This means that the data-driven initiatives rely on an engineering attitude that does not question the power of technological innovation for creating better societal solutions or, more broadly, the role of datafication in societal development. The main focus is on the correct positioning of technology: undesirable, or harmful developments need to be reversed, or redirected towards ethically more fair and responsible practices.

Since we do not possess impressive technology skills, or proficiency in legal and regulatory matters, which would have aligned us with the innovation-driven data activism, our position in the technology-driven data activism scene is structurally fairly weak. Our data activism is informed by a sensitivity to questions of cultural change and the critical stance representative to social scientific inquiry, questioning the optimistic and future-oriented social imaginary of technology developers. As will be discussed in our presentation, this means that our data activism is incompatible with those of technology developers in a profound sense, explaining why our activist role was repeatedly reduced to viewing a stream of diagrams on PowerPoint slides depicting databases and data flows. In terms of designing future data transfers and data flows, our social imaginary remained oddly irrelevant, intensifying the feeling that we were observing a moving target and our task was to simply keep up, while the engineers were busy doing to the real work of activists, developing approaches that give users more control over their personal data, such as the Kantara Initiative’s User-Managed Access (UMA) protocol, experimenting with Blockchain technologies for digital identities such as Sovrin, and learning about “Vendor Relationship Management” systems (see, Belli et al., 2017).

From the outsider position, we started to craft a narrative about the MyData initiative that aligns with our social imaginary. We wanted to push the conversation further, beyond the usual technological, legal and policy frameworks, and suggest that with its techno-optimism the current MyData work might actually weaken data activism and public support for it. We turned to literary and scholarly sources with the aim of opening a critical, but hopefully also a productive conversation about MyData in order to offer ideas of how to promote socially more robust data activism. A seminal text that shares aims of the MyData initiative is the Autonomous Technology – Technics-out-of-Control as a Theme in Political Thought (1978) by Langdon Winner. Winner perceives the relationship between human and technology in terms of Kantian autonomy: via analysis of interrelations of independence and dependence. The core ideas of the MyData vision have particular resonance with the way Winner (1978) considers “reverse adaptation”, wherein the human adapts to the power of the system and not the other way around.

In this paper, we first describe the MyData vision, as it has been presented by the activists, and situate it in the framework of technology critique and current critique of the digital culture and economy. Here, we demonstrate that the outside position can, in fact, resource a re-articulation of data activism. After this, we detail some further developments in the MyData scene and possibilities that have opened for dialogue and collaboration during our data activism journey. We end the discussion by noting that for truly promoting societally beneficial data arrangements, work is needed to circumvent the individualistic and data-centric biases of initiatives such as the MyData. We promote non-data-centric data activism that meshes critical thinking into the mundane realities of everyday practices and calls for historically informed and collectively oriented alternatives and action.

Overall, our goal is to demonstrate that with a focus on ordinary people, professionals and communities of practice, ethnographic methods and practice-based analysis can deepen understandings of datafication by revealing how data and its technologies are taken up, valued, enacted, and sometimes repurposed in ways that either do not comply with imposed data regimes, or mobilize data in inventive ways (Nafus & Sherman, 2014). By learning about everyday data worlds and actual material data practices, we can strengthen the understanding of how data technologies could become a part of promoting and enacting more responsible data futures. Paradoxically, in order to arrive to an understanding of how data initiatives support societally beneficial developments, non-data-centric data activism is called for. By aiming at non-data-centric data activism, we can continue to argue against triumphant data stories and technological solutionism in ways that are critical, but do not deny the possible value of digital data in future making. We will not try to protect ourselves against data forces but act imaginatively with and within them to develop new concepts, frameworks and collaborations in order to better steer them.


Baack, S. 2015. Datafication and empowerment: How the open data movement re-articulates notions of democracy, participation, and journalism. Big Data & Society, Oct.

Belli, L., Schwartz, M., & Louzada, L. (2017). Selling your soul while negotiating the conditions: from notice and consent to data control by design. Health and Technology, 1-15.

Børsen, T. & Botin, L. (eds) (2013). What Is Techno-Anthropology? Aalborg, Denmark: Aalborg University Press.

Kish, L. J., & Topol, E. J. (2015). Unpatients: why patients should own their medical data. Nature biotechnology, 33(9), 921-924.

Mayer-Schönberger, V., and K. Cukier. (2013). Big data: a revolution that will transform how we live, work, and think. Boston: Houghton Mifflin Harcourt.

McQuillan, D. (2016). Algorithmic Paranoia and the Convivial Alternative. Big Data and Society 3(2).

McStay, Andrew (2013). Privacy and Philosophy: New Media and Affective Protocol. New York: Peter Lang.

Milan, S., & Velden, L. V. D. (2016). The alternative epistemologies of data activism. Digital Culture & Society, 2(2), 57-74.

Nafus, D. and Sherman, J. (2014). This One Does Not Go Up to 11: The Quantified Self Movement as an Alternative Big Data Practice. International Journal of Communication 8: 1784-1794.

Poikola, A.; Kuikkaniemi, K.; & Kuittinen, O. (2014). My Data – Johdatus ihmiskeskeiseen henkilötiedon hyödyntämiseen [‘My Data – Introduction to Human-centred Utilisation of Personal Data’]. Helsinki: Finnish Ministry of Transport and Communications.

Poikola, A.; Kuikkaniemi, K.; & Honko, H. (2015). MyData – a Nordic Model for Human-centered Personal Data Management and Processing. Helsinki: Finnish Ministry of Transport and Communications.

Raley, R. (2013). Dataveillance and Counterveillance, in ed. Gitelman, Raw Data is an Oxymoron. Cambridge: MIT Press.

Ruckenstein, M. & Pantzar, M. (2015). Datafied life: Techno-anthropology as a site for exploration and experimentation. Techné: Research in Philosophy & Technology. 19(2), 191–210.

Ruckenstein, M., & Schüll, N. D. (2017). The Datafication of Health. Annual Review of Anthropology, (0).

Sharon, T. (2016) Self-Tracking for Health and the Quantified Self: Re-Articulating Autonomy, Solidarity, and Authenticity in an Age of Personalized Healthcare. Philosophy & Technology, 1-29.

Taylor, C. (2004). Modern social imaginaries. Duke University Press.

Van Dijck, J. (2014). Datafication, dataism and dataveillance: Big data between scientific paradigm and ideology. Surveillance and Society 12(2): 197–208

Van Dijck, J., & Poell, T. (2016) Understanding the promises and premises of online health platforms. Big Data & Society, 3(1), 1-11.

Wilbanks, J. T., & Topol, E. J. (2016). Stop the privatization of health data. Nature, 535, 345-348.

Winner, L. (1978). Autonomous Technology – Technics-out-of-Control As a Theme in Political Thought. Cambridge, Massachusetts, & London: The MIT Press.

Zuboff, Shoshana. 2015. “Big Other: Surveillance Capitalism and the Prospects of an Information Civilization.” Journal of Information Technology 30: 75–89.

Ruckenstein-Shaping data futures-241_a.pdf

11:30am - 11:45am
Short Paper (10+5min) [publication ready]

Digitalisation of Consumption and Digital Humanities - Development Trajectories and Challenges for the Future

Toni Ryynänen, Torsti Hyyryläinen

University of Helsinki, Ruralia Institute

Digitalisation transforms practically all areas of the modern life: everything

that can, will be digitalised. Especially the everyday routines and consumption

practices are under continual change. New digital products and services

are introduced at an accelerating pace. Purpose of this article is two-fold: the first

aim is to explore the influence of digitalisation on consumption, and secondly, to

canvas reasons for these digitalisation-driven transformations and possible future

progressions. The transformations are explored through recent consumer studies

and the future development is based on interpretations about digitalisation. Our

article recounts that digitalisation of consumption have resulted in new forms of

e-commerce, changing consumer roles and the digital virtual consumption. Reasons for these changes and expected near future progressions are based on assumptions drawn from data-driven, platform-based and disruption-generated visions. Challenges of combining consumption and the digital humanities approach

are discussed in the conclusion Section of the article.

Ryynänen-Digitalisation of Consumption and Digital Humanities-202_a.pdf
Ryynänen-Digitalisation of Consumption and Digital Humanities-202_c.pdf

11:45am - 12:00pm
Short Paper (10+5min) [abstract]

Its your data, but my algorithms

Tomi Dufva

Aalto-University, the school of Arts, Design and Architecture,

The world is increasingly digital, but the understanding of how the digital affects everyday life is still often confused. Digitalisation is sometimes optimistically thought as a rescue from hardships, be it economical or even educational. On the other hand, digitalization is seen negatively as something one just can’t avoid. Digital technologies have replaced many previous tools used in work as well as in leisure. Furthermore, digital technologies present an agency of their own into the human processes as marked by David Berry. Through manipulating data through algorithms and communicating not only with humans, but other devices as well, digital technology presents new kind of challenges for the society and individual. These digital systems and data flow get their instructions from the code that runs on these systems. The underneath code itself is not objective nor value-free and carries own biases as well as programmers, software companies or larger cultural viewpoints objectives. As such, digital technology affects to the ways, we structure and comprehend, or are even able to comprehend the world around us.

This article looks at the surrounding digitality through an artistic research project. Through using code not as a functional tool but in a postmodern way as a material for expression, the research focuses on how code as art can express the digital condition that might otherwise be difficult to put into words or comprehend in everyday life. The art project consists of a drawing robot controlled by EEG-headband that the visitor can wear. The headband allows the visitor to control the robot through the EEG-readings read by the headband. As such the visitor might get a feeling of being able to control the robot, but at the same time the robot interprets the data through its algorithms and thus controls the visitor's data.

The aim of this research projects is to give perspectives to the everydayness of digitality. It wants to question how we comprehend digital in everyday life and asks how we should embody digitality in the future. The benefits of artistic research are in the way it can broaden the conceptions of how we know and as such can deepen one’s understanding of the complexities of the world. Furthermore, artistic research can expand the meaning to alternative interpretations of the research subjects. As such, this research project aims at the same time to deepen the discussion of digitalization and to broaden it to alternative understandings. The alternative ways of seeing a phenomenon, like digitality, are essential in the ways future is developed. The proposed research consists of both the theoretical text and the interactive artwork, which would be present in the conference.

Dufva-Its your data, but my algorithms-150_a.pdf
12:00pm - 12:45pmLunch + poster setup
Think Corner
12:45pm - 2:30pmPoster Slam (lunch continues), Poster Exhibition & Coffee
Session Chair: Annika Rockenberger
Think Corner 
Poster [abstract]

Shearing letters and art as digital cultural heritage, co-operation and basic research

Maria Elisabeth Stubb

Svenska litteratursällskapet i Finland,

Albert Edelfelts brev ( is a web publication developed at the Society of Swedish Literature in Finland. In co-operation with the Finnish National Gallery, we publish letters of the Finnish artist Albert Edelfelt (1854–1905) combined with pictures of his artworks. Albert Edelfelts brev received in 2016 the State Award for dissemination of information. The co-operation between institutions and basic research of the material has enabled a unique reconstruction of Edelfelt’s artistry and his time, for the service of researchers and other users. I will present how we have done it and how we plan to further develop the website.

The website Albert Edelfelts brev launched in September 2014, with a sample of Edelfelt’s letters and paintings. Our intention is to publish all the letters Albert Edelfelt wrote to his mother Alexandra (1833–1901). The collection consists of 1 310 letters, that range over 30 years and cover most of Edelfelt’s adult life. The letters are in the care of the Society of Swedish Literature in Finland. We also have to our disposal close to 7 000 pictures of Edelfelt’s paintings and sketches in the care of the Finnish National Gallery.

In the context of digital humanities, the volume of the material at hand is manageable. However, for researchers who think that they might have use of the material, but are unsure of exactly where or what to look for, it might be labour intensive to go through all the letters and pictures. We have combined professional expertise and basic research of the material with digital solutions to make it as easy as possible to take part of what the content can offer.

As editor of the web publication, I spend a considerable part of my work on basic research in identifying people, and pinpointing paintings and places that Edelfelt mentions in his letters. By linking the content of a letter to artworks, persons, places and subjects/reference words users can easily navigate in the material. Each letter, artwork and person has a page of its own. Even places and subjects are searchable and listed.

The letters are available as facsimile pictures of the handwritten pages. Each letter has a permanent web resource identifier (URN:NBN). In order to make it easier for users to decide if a letter is of interest, we have tagged subjects using reference words from ALLÄRS (common thesaurus in Swedish). We have also written abstracts of the content, divided them into separate “events” and tagged mentioned artworks, people and places to these events.

Each artwork of Edelfelt has a page of its own. Here, users find a picture of the artwork (if available) and earlier sketches of the artwork (if available). By looking at the pictures, they can see how the working process of the painting has developed. Users can also follow the process through what Edelfelt writes in his letters. All the events from the letter abstracts that are tagged to the specific artwork are listed in chronological order on the artwork-page.

Persons tagged in the letter abstracts also have pages of their own. On a person-page, users find basic facts and links to other webpages with information about the person. Any events from the letter abstracts mentioning the person are listed as well. In other words, through a one-click-solution users can find an overview on everything Edelfelt’s letters have to say about a specific person. Tagging persons to events has also made it possible to build graphs of a person’s social network; based on how many times other persons are tagged to the same events as the specific person. There is a link to these graphs on every person-page.

Apart from researchers who have a direct interest in the material, we have also wanted to open up the cultural heritage to a broader public and group of users. Each month the editorial staff writes a blog-post on SLS-bloggen ( Albert Edelfelts brev also has a profile on Facebook ( where we post excerpts of letters on the same date as Edelfelt wrote the original letter. By doing so we hope to give the public an insight in the life of Edelfelt and the material, and involve them in the progress of the project.

The web publication has open access. The mix of different sources and the co-operation with other heritage institutions has led to a mix of licenses for how users can copy and redistribute the published material. The Finnish National Gallery (FNG) owns copyright on its pictures in the publication and users have to get permission from FNG to copy and redistribute that material. The artwork-pages contain descriptions of the paintings written by the art historian Bertel Hintze, who published a catalogue of Edelfelt’s art in 1942. These texts are licensed with a Creative Commons Attribution-NoDerivs 4.0 Generic (CC BY-ND 4.0). Edelfelt’s letters as well as the texts and metadata produced by the editorial staff at the Society of Swedish Literature in Finland have a Creative Commons CC0 1.0 Universal-license. Data with Creative Commons-license is also freely available as open data through a REST API (

In the future, we would like to find a common practice for the user rights; if possible, even so all the material would have the same license. We intend to invite other institutions with artworks of Edelfelt to co-operate, offering the same kind of partnership as the web publication has with the Finnish National Gallery. Thus, we are striving to a complete as possible site with the artworks of Edelfelt.

Albert Edelfelt is of national interest and his letters, which he mostly wrote during his stays abroad, contain information of international interest. Therefore, we plan to offer the metadata and at least some of the source material in Finnish and English translations. So far, the letters are only available as facsimile. The development of transcription programs for handwritten texts has made it probable that we in the future could include transcriptions of the letters in the web publication. Linguists especially have an interest in getting a searchable letter transcription for their researches, and the transcriptions would even be helpful for users who might have problem reading the handwritten text.

Stubb-Shearing letters and art as digital cultural heritage, co-operation and basic research-182_a.docx
Stubb-Shearing letters and art as digital cultural heritage, co-operation and basic research-182_c.pdf

Poster [abstract]

Metadata Analysis and Text Reuse Detection: Reassessing public discourse in Finland through newspapers and journals 1771–1917

Filip Ginter1, Antti Kanner2, Leo Lahti1, Jani Marjanen2, Eetu Mäkelä2, Asko Nivala1, Heli Rantala1, Hannu Salmi1, Reetta Sippola1, Mikko Tolonen2, Ville Vaara2, Aleksi Vesanto2

1University of Turku; 2University of Helsinki

During the period 1771–1917 newspapers developed as a mass medium in the Grand Duchy of Finland. This happened within two different imperial configurations (Sweden until 1809 and Russia 1809–1917) and in two main languages – Swedish and Finnish. The Computational History and the Transformation of Public Discourse in Finland, 1640–1910 (COMHIS) project studies the transformation of public discourse in Europe and in Finland via an innovative combination of original data, state-of-the-art quantitative methods that have not been previously applied in this context, and an open source collaboration model.

In this study the project combines the statistical analysis of newspaper metadata and the analysis of text reuse within the papers to trace the expansion of and exchange in Finnish newspapers published in the long nineteenth century. The analysis is based on the metadata and content of digitized Finnish newspapers published by the National library of Finland. The dataset includes full text of all newspapers and most periodicals published in Finland between 1771 and 1920. The analysis of metadata builds on data harmonization and enrichment by extracting information on columns, type sets, publications frequencies and circulation records from the full-text files or outside sources. Our analysis of text reuse is based on a modified version of the Basic Local Alignment Search Tool (BLAST) algorithm, which can detect similar sequences and was initially developed for fast alignment of biomolecular sequences, such as DNA chains. We have further modified the algorithm in order to identify text reuse patterns. BLAST is robust to deviations in the text content, and as such able to effectively circumvent errors or differences arising from optical character recognition (OCR).

By relating metadata on publication places, language, number of issues, number of words, size of papers, and publishers and comparing that to the existing scholarship on newspaper history and censorship, the study provides a more accurate bird’s-eye view of newspaper publishing in Finland after 1771. By pinpointing key moments in the development of journalism the study suggest that the while the discussions in the public were inherently bilingual, the technological and journalistic developments advanced at different speeds in Swedish and Finnish language forums. It further assesses the development of the press in comparison with book production and periodicals, pointing towards a specialization of newspapers as a medium in the period post 1860. Of special interest is that the growth and specialization of the newspaper medium was much indebted to the newspapers being established all over the country and thus becoming forums for local debates.

The existence of a medium encompassing the whole country was crucial to the birth of a national imaginary. Yet, the national public sphere was not without regional intellectual asymmetries. This study traces these asymmetries by analysing text reuse in the whole newspaper corpus. It shows which papers and which cities functioned as “senders” and “receivers” in the public discourse in this period. It is furthermore essential that newspapers and periodicals had several functions throughout the period, and the role of the public sphere cannot be taken for granted. The analysis of text reuse further paints a picture of virality in newspaper publishing that was indicative of modern journalistic practices but also reveals the rapidly expanding capacity of the press. These can be further contrasted to other items commonly associated with the birth of modern journalism such as publication frequency, page sizes and typesetting of the papers.

All algorithms, software, and the text reuse database will be made openly available online, and can be located through the project’s repositories ( and The results of the text reuse detection carried out in BLAST are stored in a database and will also be made available for the exploration of other researchers.

Ginter-Metadata Analysis and Text Reuse Detection-156_a.pdf
Ginter-Metadata Analysis and Text Reuse Detection-156_c.pdf

Poster [abstract]

Oceanic Exchanges: Tracing Global Information Networks In Historical Newspaper Repositories, 1840-1914

Hannu Salmi, Mila Oiva, Asko Nivala, Otto Latva

University of Turku,

Oceanic Exchanges: Tracing Global Information Networks in Historical Newspaper Repositories, 1840-1914 (OcEx) is a Digging into Data – Transatlantic Platform funded international and interdisciplinary project with a focus on studying spreading of news globally in the nineteenth century newspapers. The project combines digitized newspapers from Europe, US, Mexico, Australia, New Zealand, and the British and Dutch colonies of that time all over the world.

The project examines patterns of information flow, spread of text reuse, and global conceptual changes across national, cultural and linguistic boundaries in the nineteenth century newspapers. The project links the different newspaper corpora, scattered into different national libraries and collections using various kinds of metadata and printed in several languages, into one whole.

The project proposes to present a poster in the Nordic Digital Humanities Conference 2018. The project started in June 2017, and the aim of the poster is to present the current status of the project.

The research group members come from Finland, the US, the Netherlands, Germany, Mexico, and UK. OcEx’s participating institutions are Loughborough University, Northeastern University, North Carolina State University, Universität Stuttgart, Universidad Nacional Autónoma de México, University College London, University of Nebraska-Lincoln, University of Turku, and Utrecht University. The project’s 90 million newspaper pages come from Australia's Trove Newspapers, the British Newspapers Archive, Chronicling America (US), Europeana Newspapers, Hemeroteca Nacional Digital de México, National Library of Finland, National Library of the Netherlands (KB), the National Library of Wales, New Zealand’s PapersPast, and a strategic collaboration with Cengage Publishing, one of the leading commercial custodians of digitized newspapers.


Our team will hone computational tools, some developed in prior research by project partners and novel ones, into a suite of openly available tools, data, and analyses that trace a broad range of language-related phenomena (including text reuse, translational shifts, and discursive changes). Analysing such parameters enables us to characterize “reception cultures,” “dissemination cultures,” and “reference cultures” in terms of asymmetrical flow patterns, or to analyse the relationships between reporting targeted at immigrant communities and their surrounding host countries.

OcEx will leverage existing relationships and agreements between its teams and data providers to connect disparate digital newspaper collections, opening new questions about historical globalism and modeling consortial approaches to transnational newspaper research. OcEx will take up challenging questions of historical information flow, including:

1. Which stories spread between nations and how quickly?

2. Which texts were translated and resonated across languages?

3. How did textual copying (reprinting) operate internationally compared to conceptual copying (idea spread)?

4. How did the migration of texts facilitate the circulation of knowledge, ideas, and concepts, and how were these ideas transformed as they moved from one Atlantic context to another?

5. How did geopolitical realities (e.g. economic integration, technology, migration, geopolitical power) influence the directionality of these transnational exchanges?

6. How does reporting in immigrant and ethnic communities differ from reporting in surrounding host countries?

7. Does the national organization of digitized newspaper archives artificially foreclose globally-oriented research questions and outcomes?


OcEx will develop a semantic interoperable knowledge structure, or ontology, for expressing thematic and textual connections among historical newspaper archives. Even with standards in place, digitization projects pursue differing approaches that pose challenges to integration or particular levels of analysis. In most, for instance, generic identification of items within newspapers has not been pursued. In order to build an ontology, this project will build on knowledge acquired by participating academic partners, such as the project TimeCapsule at Utrecht University, as well as analytical software that has been tested and used by team members, such as viral text analysis. OcEx does not aim to create a totalizing research infrastructure but rather to expose the conditions by which researchers can work across collections, helping guide similar projects in future seeking to bridge national collections. This ontology will be established through comparative investigations of phenomena illustrating textual links: reprinting and topic dissemination. We have divided the tasks into six work packages:

WP1: Management

➢ create an international network of researchers to discuss issues of using and accessing newspaper repository data and combine expertise toward better development and management of such data;

➢ assemble a project advisory board, consisting of representatives of public and private data custodians and other critical stakeholders.

WP2: Assessment of Data and Metadata

➢ investigate and develop classifier models of the visual features of newspaper content and genres;

➢ create a corpus of annotations on clusters/passages that records relationships among textual versions.

WP3: Creating a Networked Ontology for Research

➢ create an ontology of genres, forms, and elements of texts to support that annotation;

➢ select and develop best practices based on available technology (semantic web standard RDF, linked data, SKOS, XML markup standards such as TEI).

WP4: Textual Migration and Viral Texts

➢ analyze text reuse across archives using statistical language models to detect clusters of reprinted passages;

➢ perform analyses of aggregate information flows within and across countries, regions, and publications;

➢ develop adaptive visualization methods for results.

WP5: Conceptual Migration and Translation Shifts

➢ perform scalable multilingual topic model inference across corpora to discern translations, shared topics, topic shifts, and concept drift within and across languages, using distributional analysis and (hierarchical) polylingual topic models;

➢ analyze migration and translation of ideas over regional and linguistic borders;

➢ develop adaptive visualization methods for the results.

WP6: Tools of Delivery/Dissemination

➢ validation of test results in scholarly contexts/test sessions at academic institutions;

➢ conduct analysis of the sensitivity of results to the availability of corpora in different languages and levels of access;

➢ share findings (data structures/availability/compatibility, user experiences) with institutional partners;

➢ package code, annotated data (where possible), and ontology for public release.

Salmi-Oceanic Exchanges-217_a.docx

Poster [abstract]

ArchiMob: A multidialectal corpus of Swiss German oral history interviews

Yves Scherrer1, Tanja Samardžić2

1University of Helsinki, Department of Digital Humanities; 2University of Zurich, CorpusLab, URPP Language and Space

Although dialect usage is prevalent in the German-speaking part of Switzerland, digital resources for dialectological and computational linguistic research are difficult to obtain. In this paper, we present a freely available corpus of spontaneous speech in various Swiss German dialects. It consists in transcriptions of video interviews with contemporary witnesses of the Second World War period in Switzerland. These recordings were produced by an association of Swiss historians called Archimob about 20 years ago. More than 500 informants stemming from all linguistic regions of Switzerland (German, French and Italian) and representing both genders, different social backgrounds, and different political views, were interviewed. Each interview is 1 to 2 hours long. In collaboration with the University of Zurich, we have selected, processed and analyzed a subset of 43 interviews in different Swiss German dialects.

The goal of this contribution is twofold. First, we describe how the documents were transcribed, segmented and aligned with the audio source and how we make the data available on specifically adapted corpus query engines. We also provide an additional normalization layer in order to reduce the different types of variation (dialectal, speaker-specific and transcriber-specific) present in the transcriptions. We formalize normalization as a machine translation task, obtaining up to 90% of accuracy (Scherrer & Ljubešić 2016).

Second, we show through some examples how the ArchiMob resource can shed new lights on research questions from digital humanities in general and dialectology and history in particular:

• Thanks to the normalization layer, dialect differences can be identified and compared with existing dialectological knowledge.

• Using language modelling, another technique borrowed from language technology, we can compute distances between texts. These distance measures allow us to identify the dialect of unknown utterances (Zampieri et al. 2017), localize transcriber effects and obtain a generic picture of the Swiss German dialect landscape.

• Departing from the purely formal analysis of the transcriptions for dialectological purposes, we can apply methods such as collocation analysis to investigate the content of the interviews. By identifying the key concepts and events referred to in the interviews, we can assess how the different informants perceive and describe the same time period.


Poster [abstract]

Serious gaming to support stakeholder participation and analysis in Nordic climate adaptation research

Tina-Simone Neset1, Sirkku Juhola2, Therese Asplund1, Janina Käyhkö2, carlo Navarra1

1Linköping University,; 2Helsinki University


While climate change adaptation research in the Nordic context has advanced significantly in recent years, we still lack a thorough discussion on maladaptation, i.e. the unintended negative outcomes as a result of implemented adaptation measures. In order to identify and assess examples of maladaptation for the agricultural sector, we developed a novel methodology, integrating visualization, participatory methods and serious gaming. This enables research and policy analysis of trade-offs between mitigation and adaptation options, as well as between alternative adaptation options with stakeholders in the agricultural sector. Stakeholders from the agricultural sector in Sweden and Finland have been engaged in the exploration of potential maladaptive outcomes of climate adaptation measures by means of a serious game on maladaptation in Nordic agriculture, and discussed their relevance and related trade offs.

The Game

The Maladaptation Game is designed as a single player game. It is web-based and allows a moderator to collect the settings and results for each player involved in a session, store these for analysis, and display these results on a ‘moderator screen’. The game is designed for agricultural stakeholders in the Nordic countries, and requires some prior understanding of the challenges that climate change can impose on Nordic agriculture as well as the scope and function of adaptation measures to address these challenges.

The gameplay consists of four challenges, each involving multiple steps. At the start of the game, the player is equipped with a limited number of coins, which decrease for each measure that is selected. As such, the player has to consider the implications in terms of risk and potential negative effects of a selected measure as well as the costs for each of these measures. The player is challenged with four different climate related challenges – increased precipitation, drought, increased occurrence of pests and weeds, and a prolonged growing season - that are all relevant to Nordic agriculture. The player selects one challenge at a time. Each challenge has to be addressed, and once a challenge has been concluded, the player cannot return and revise the selection. When entering a challenge (e.g. precipitation) possible adaptation measures that can be taken to address this challenge in an agricultural context, are displayed as illustrated cards on the game interface. Each card can be turned to receive more information, i.e. a descriptive text and the related costs. The player can explore all cards before selecting one. The selected adaptation measure is then leading to a potential maladaptive outcome, which is again displayed as an illustrated card with an explanatory text on the backside. The player has to decide to reject or accept this potential negative outcome. If the maladaptive outcome is rejected, the player returns to the previous view, where all adaptation measures for the current challenge are displayed, and can select another measure, and make the decision whether to accept or reject the potential negative outcome that is presented for these. In order to complete a challenge, one adaptation measure with the related negative outcome has to be accepted. After completing a challenge, the player returns to the entry page, where, in addition to the overview of all challenges, a small scoreboard summarizes the selection made, displays the updated amount of coins as well as a score of maladaptation-points. These points represent the negative maladaptation score for the selected measures and are a measure that the player does not know prior to making the decision.

The game continues until selections have been made for all four challenges. At the end of the game, the player has an updated scoreboard with three main elements: the summary of the selections made for each challenge, the remaining number of coins, and the total sum of the negative maladaptation score. The scoreboards of all players involved in a session appear now on the moderator screen. This setup allows the individual player to compare his or her pathways and results with other players. The key feature of the game is hence the stimulation of discussions and reflections concerning adaptation measures and their potential negative outcomes, both with regard to adding knowledge about adaptation measures and their impact as well as the threshold of when an outcome is considered maladaptive, i.e. what trade offs are made within agricultural climate adaptation.

Preliminary conclusions from the visualization supported gaming workshops

During autumn 2016, eight gaming workshops were held in Sweden and Finland. These workshops were designed as visualization supported focus groups, allowing for some general reflections, but also individual interaction with the web-based game. Stakeholders included farmers, agricultural extension officers, and representatives of branch organizations as well as agricultural authorities on the national and regional level. Focus group discussions were recorded and transcribed in order to analyze the empirical results with focus on agricultural adaptation and potential maladaptive outcomes.

Preliminary conclusions from these workshops point towards several issues that relate both to content and functionality of the game. While, as a general conclusion, the stakeholders were able to quickly get acquainted with the game and interact without larger difficulties, some few individual participants were negative to the general idea of engaging with a game to discuss these issues. The level of interactivity that the game allows, where players can test and explore, before making a decision, enabled reflections and discussions also during the gameplay. Stakeholders frequently tested and returned to some of the possible choices before deciding on their final setting. Since the game demands the acceptance of a potential negative outcome, several stakeholders described their impression of the game as a ‘pest or cholera’ situation. In terms of empirical results, the workshops generated a large number of issues regarding the definition of maladaptive outcomes and their thresholds, in relation to contextual aspects, such as temporal and spatial scales, as well as reflections regarding the relevance and applicability of the proposed adaptation measures and negative outcomes.

Neset-Serious gaming to support stakeholder participation and analysis-171_a.pdf

Poster [abstract]

Challenges in textual criticism and editorial transparency

Elisa Johanna Veit, Pieter Claes, Per Stam

Svenska litteratursällskapet i Finland,

Henry Parlands Skrifter (HPS) is a digital critical edition of the works and correspondence of the modernist author Henry Parland (1908–1930). The poster presents chosen strategies for communicating the results of the process of textual criticism in a digital environment. How can we make the foundations for editorial decisions transparent and easily accessible to a reader?

Textual criticism is by one of several definitions “the scientific study of a text with the intention of producing a reliable edition” (Nationalencyklopedin, “textkritik”. Our translation.) When possible, the texts of the HPS edition are based on original prints whose publication was initiated by the author during his lifetime. However, rendering a reliable text largely requires a return to original manuscripts as only a fraction of Parland’s works were published before the author’s death at the age of 22 in 1930. Posthumous publications often lack reliability due to the editorial practices and sometimes primarily aesthetic solutions to text problems of later editors.

The main structure of the Parland digital edition is related to Zacharias Topelius Skrifter ( and similar editions (e.g. grundtvigsvæ However, the Parland edition has foregone the system of a – theoretically – unlimited amount of columns in favour of only two fields for text: a field for the reading text, which holds a central position on the webpage, and a smaller, optional, field containing, in different tabs, editorial commentary, facsimiles and transcriptions of manuscripts and original prints. The benefit of this approach is easier navigation. If a reader wishes to view several fields at once, they may do so by using several browser windows, which is explained in the user’s guide.

The texts of the edition are transcribed in XML and encoded following TEI (Text Encoding Initiative) Guidelines P5. Manuscripts, or original prints, and edited reading texts are rendered in different files (see further below). All manuscripts and original prints used in the edition are presented as high-resolution facsimiles. The reader thus has access to the different versions of the text in full, as a complement to the editorial commentary.

Parland’s manuscripts often contain several layers of changes (additions, deletions, substitutions): those made by the author himself during the initial process of writing or during a later revision, and those made by posthumous editors selecting and preparing manuscripts for publication. The editor is thus required to analyse the manuscripts in order to include only changes made by the author in the text of the edition. The posthumous changes are included in the transcriptions of the manuscripts and encoded using the same TEI elements as the author’s changes with an addition of attributes indicating the other hand and pen (@hand and @medium). In the digital edition these changes, as well as other posthumous markings and notes, are displayed in a separate colour. A tooltip displays the identity of the other hand.

One of the benefits of this solution is transparency towards the reader through visualization of the editor’s interpretation of all sections of the manuscript. The using of standard TEI elements and attributes facilitate possible use of the XML-documents for purposes outside of the edition. For the Parland project, there were also practical benefits concerning technical solutions and workflow in using mark-up that had already, though to a somewhat smaller extent, been used by the Zacharias Topelius edition.

The downside to using the same elements for both authorial and posthumous changes is that the XML-file will not very easily lend itself to a visualization of the author’s version. Although this surely would not be impossible with an appropriately designed stylesheet, we have deemed it more practical to keep manuscripts and edited reading texts in separate files. All posthumous intervention and associated mark-up are removed from the edited text, which has the added practical benefit of making the XML-document more easily readable to a human editor. However, the information value of the separate files is more limited than that of a single file would be.

The file with the edited text still contains the complete author’s version, according to the critical analysis of the editor. Editorial changes to the author’s text are grouped together with the original wording in the TEI-element choice and the changes are visualized in the digital edition. The changed section is highlighted and the original wording displayed in a tooltip. Thus, the combination of facsimile, transcription and edited text in the digital edition visualizes the editor’s source(s), interpretation and changes to the text.


Nationalencyklopedin, “textkritik”.ång/textkritik (accessed 2017-10-19).

Veit-Challenges in textual criticism and editorial transparency-154_a.pdf
Veit-Challenges in textual criticism and editorial transparency-154_c.pdf

Poster [publication ready]

Digitizing the Icelandic-Danish Blöndal Dictionary

Steinþór Steingrímsson

The Árni Magnússon Institute for Icelandic Studies, Iceland,

The Icelandic-Danish dictionary, compiled by Sigfús Blöndal in the early 20th century is being digitized. It is the largest dictionary ever published in Icelandic, containing in total more than 150,000 entries. The digitization work started with a pilot project in 2016 resulting in a comprehensive plan on how to carry out the task. The paper describes the ongoing work, methods and tools applied as well as the aim of the project and rationale. We opted for using OCR and not double-keying, which has become common for similar projects. First results suggest the outcome is satisfactory, as the final version will be proofread. The entries are annotated with XML-entities, using a workbench built for the project. We apply automatic annotation for the most consistent entities, but other annotation is carried out manually. The data is then exported into a relational database, proofread and finally published. Publication date is set for spring 2020.

Steingrímsson-Digitizing the Icelandic-Danish Blöndal Dictionary-227_a.pdf

Poster [abstract]

Network visualization for historical corpus linguistics: externally-defined variables as node attributes

Timo Korkiakangas

University of Oslo,

In my poster presentation, I will explore whether and how network visualization can benefit philological and historical-linguistic research. This will be implemented by examining the usability of network visualization for the study of early medieval Latin scribes' language competences. Thus, the scope is mainly methodological, but the proposed methodological choices will be illustrated by applying them to a real data set. Four linguistic variables extracted corpus-linguistically from a treebank will be examined: spelling correctness, classical Latin prepositions, genitive plural form, and <ae> diphthong. All the four are continuous, which is typical of linguistic variables. The variables represent different domains of language competence of the scribes who learnt written Latin practically as a second-language by that time. Even more linguistic features will be included in the analysis if my ongoing project proceeds as planned.

Thus, the primary objective of the study is to find out whether the network visualization approach has demonstrable advantages compared to ordinary cross-tabulations as far as support to philological and historical-linguistic argumentation is concerned. The main means of visualization will be the gradient colour palette in Gephi, a widely used open-source network analysis and visualization software package. As an inevitable part of the described enterprise, it is necessary to clarify the scientific premises for the use of network environment to display externally-defined values of linguistic variables. It is obvious that in order to be utilized for research purposes, network visualization must be as objective and replicable as possible.

By way of definition, I emphasize that the proposed study will not deal with linguistic networks proper, i.e. networks which are directly induced or synthesized from a linguistic data set and represent abstract relations between linguistic units. Consequently, no network metric will be calculated, even though that might be interesting as such. What will be visualized are the distributions of linguistic variables that do not arise from the network itself, but are derived externally from a medium-sized treebank by exploiting its lemmatic, morphological, and, hopefully, also syntactic annotation layers. These linguistic variables will be visualized as attributes of the nodes in the trimodal "social" network which consists of the documents, persons, and places that underlie the treebank. These documents, persons, and places are encoded as the metadata in the treebank. The nodes are connected to each other by unweighted edges. The number of document nodes is 1,040, scribe nodes 220, and writing place nodes 84. In most cases, the definition of the 220 writer nodes is straightforward, given that the scribes scrupulously signed what they wrote, with the exception of eight documents. The place nodes are more challenging. Although 78% of the documents has been written in the city of Lucca, the disambiguation and re-grouping of small localities of which little is known was time-consuming and the results not always fully satisfying. The nodes will be set on the map background by utilizing Gephi's Geo Layout and Force Atlas 2 algorithms.

The linguistic features that will be visualized reflect the language change that took place in late Latin and early medieval Latin, roughly the 3rd to 9th centuries AD. The features are operationalized as variables which quantify the variation of those features in the treebank. This quantification is based on the numerical output of a plethora of corpus-linguistic queries which extract from the treebank all constructions or forms that meet the relevant criteria. The variables indicate the relative frequency of the examined features in each document, scribe, and writing place. For the scribes and writing places, the percentages are calculated by counting the occurrences within all the documents written by that scribe or in that place, respectively.

The resulting linguistic variables are continuous, hence the practicality of the gradient colouring. In order to ground colouring in the statistical dispersion of the variable values and to conserve maximal visual effect, I customize the Gephi default red-yellow-blue palette so that the maximal yellow, which stands for the middle of the colour scale, marks the mean of the distribution of each variable. Likewise, the thresholds of the maximal red and maximal blue are set equally far from the mean. I chose that distance to be two standard deviations away from the mean. In this way, only around 2.5% of the nodes with the lowest and highest values at both ends of the distribution are maximally saturated with red and blue while the rest, around 95%, of the nodes features a gradient colour, including the maximal yellow in the between. Following this rule, I will illustrate the variables both separately and as a sum variable. The images will be available in the poster. The sum variable will be calculated by aggregating the standardized simple variables.

The preliminary conclusions include the observation that network visualization, as such, is not a sufficient basis for philological or historical-linguistic argumentation, but if used along with statistical approach, it can support argumentation by drawing attention to unexpected patterns and – on the other hand – to irregularities. However, it is the geographical layout of the graphs that gives the most of the surplus in regard to traditional approaches: it helps in perceiving patterns that would have otherwise failed to be noticed.

The treebank on which the analyses are based is the Late Latin Charter Treebank (version 2, LLCT2), which consists of 1,040 early medieval Latin documentary texts (c. 480,000 words). The documents have been written in historical Tuscia (Tuscany), Italy, between AD 714 and 897, and are mainly sale or purchase contracts or donations, accompanied by a few judgements as well as lists and memoranda. LLCT2 is still under construction and only the first half of it is already provided with the syntactically annotated layer, thus making it a treebank proper (i.e. LLCT, version 1). The lemmatization and morphological annotation style are based on the Ancient Greek and Latin Dependency Treebank (AGLDT) style which can be deduced from the Guidelines for the Syntactic Annotation of Latin Treebanks. Korkiakangas & Passarotti (2011) define a number of additions and modifications to these general guidelines which are designed for Classical Latin. For a more detailed description of the LLCT2 and the underlying text editions, see Korkiakangas (in press). Documents are privileged material for examining the spoken/written interface of early medieval Latin, in which the distance between the spoken and written codes had grown considerable by the Late Antiquity. The LLCT2 documents have precise dating and location metadata and they survive as originals.


Adams J.N. Social variation and the Latin language. Cambridge University Press (Cambridge), 2013.

Araújo T. and Banisch S. Multidimensional Analysis of Linguistic Networks. Mehler A., Lücking A., Banisch S., Blanchard P. and Job, B. (eds) Towards a Theoretical Framework for Analyzing Complex Linguistic Networks. Springer (Berlin, Heidelberg), 2016, 107-131.

Bamman D., Passarotti M., Crane G. and Raynaud S. Guidelines for the Syntactic Annotation of Latin Treebanks (v. 1.3), 2007

Barzel B. and Barabási A.-L. Universality in network dynamics. Nature Physics. 2013;9:673-681.

Bergs A. Social Networks and Historical Sociolinguistics: Studies in Morphosyntactic Variation in the Paston Letters. Walter de Gruyter (Berlin), 2005.

Ferrer i Cancho R. Network theory. Hogan P.C. (ed.) The Cambridge Encyclopedia of the Language Sciences. Cambridge University Press (Cambridge), 2010, 555–557.

Korkiakangas T. (in press) Spelling Variation in Historical Text Corpora: The Case of Early Medieval Documentary Latin. Digital Scholarship in the Humanities.

Korkiakangas T. and Lassila M. Abbreviations, fragmentary words, formulaic language: treebanking medieval charter material. Mambrini F., Sporleder C. and Passarotti M. (eds) Proceedings of the Third Workshop on Annotation of Corpora for Research in the Humanities (ACRH-3), Sofia, December 13, 2013. Bulgarian Academy of Sciences (Sofia), 2013, 61-72.

Korkiakangas T. and Passarotti M. Challenges in Annotating Medieval Latin Charters. Journal of Language Technology and Computational Linguistics. 2011;26,2:103-114.

Korkiakangas-Network visualization for historical corpus linguistics-237_a.pdf
Korkiakangas-Network visualization for historical corpus linguistics-237_c.pdf

Poster [abstract]

Approaching a digital scholarly edition through metadata

Katarina Pihlflyckt

Svenska litteratursällskapet i Finland r.f.

This poster presents a flowchart with an overview of the database structure in the digital critical edition of Zacharias Topelius Skrifter (ZTS). It shows how the entity relations open a possibility for the user to approach the edition from other angles than the texts, using informative metadata through indexing systems. Through this data, a historian can easily capture for example events, meetings between people or editions of books, as they are presented in Zacharias Topelius’ (1818–1898) texts. Presented here are both already available features and features in progress.

ZTS comprises eight digital volumes hitherto, the first published in 2010. This includes the equivalent of about 8 500 pages of text by Topelius, 600 pages of introduction by editors and 13 000 annotations. The published volumes cover poetry, short stories, correspondences, children’s textbooks, historical-geographical works and university lectures on history and geography. It is freely accessible at Genres still to be published include children’s books, novels, journalism, academica, diaries and religious texts.


The ZTS database structure consists of six connected databases: people, places, bibliography, manuscripts, letters and a chronology. So far, the people database consists of about 10 000 unique persons, and a possibility to link them to a family or group level (250 records). It has separate chapters for mythological persons (500 records) and fictive characters (250 records). The geographic database has 6 000 registered places. The bibliographic database has 6 000 editions divided on 3 500 different works, and the manuscript database has 1 400 texts on 350 physical manuscripts. The letter database has 4 000 registered letters to and from Topelius, divided on 2 000 correspondences. The chronology of Topelius life has 7 000 marked events. The indexing of objects started in 2005, using the FileMaker system. New records are continuously added and the work with finding more possibilities on how to use, link and present the data is in constant progress. The users can freely access the information in database records that link to the published volumes.

The bibliographic database is the most complex database. The structure follows the Functional Requirements for Bibliographic Records (FRBR) model, which means we are making a difference between the abstract work and the published manifestations (editions) of that work. The FRBR focuses on the content relationship and continuum between the levels; anything regarded a separate work starts as a new abstract record, from where its own editions are created. Within ZTS, the abstract level has a practical significance, in cases when it is impossible to determine to which exact edition Topelius is referring. Also taken in consideration is that for example articles and short stories can have their own independent editions as well as being included in editions (e.g. a magazine, an anthology). This requires two different manifestation levels subordinated the abstract level; the regular editions and the texts included in other editions, the records of the latter type must always link to records of the former.

The manuscript database has a content relationship to the bibliographic database through the abstract entity of a work. A manuscript text can be regarded as an independent edition of a work in this context (a manuscript that was never published can easily have a future edition added in the bibliographic database). The manuscript text itself might share physical paper with another manuscript text. Therefore, the description of the physical manuscript is created on a separate level in the manuscript database, to which the manuscript text is connected.

The letter database follows the FRBR model; an upper level presents the whole correspondence between Topelius and another person, and a subordinated level describes each physical letter within the correspondence. It is possible to attach additional corresponding persons to occasional letters.

The people database connects to the letter database and the bibliographic database, creating a one-to-many relationship. Any writer or author has to be in the people database in order to have their information inserted into these two databases. Within the people database there is also a family or group level, where family members can be grouped, but in contrary to the letter database, this is not a superordinate level.

The geographic database follows a one-level structure. Places in letters and manuscripts can be linked from the geographic database.

The chronology database contains manually added key events from Topelius’ life, as well as short diary entries made by him in various calendars during his life. It also has automatically gathered records from other databases, based on marked dates when Topelius works were published or when he wrote a letter or a manuscript. The dates of birth and/or death of family members and close friends can be linked from the people database.


Approaching a digital scholarly edition with over 8 500 pages can be a heavy task, and many will likely use the edition more as an object to study, rather than texts to read. For a user not familiar with the content of the different volumes, but still looking for specific information, advanced searches and indexing systems offer a faster path into the relevant text passages. The information in the ZTS database records provides a picture of Finland in the 19th century as it appears in Topelius’ works and life. A future feature for users is access to this data through an API (Application Programming Interface). This will create opportunities for the user to take advantage of the data in any wanted way: to create a 19th century bookshelf, an app for the most popular 19th century names or a map of popular student hangouts in 1830’s Helsinki.

Through the indexes formed by the linked data from the texts, the user can find all the occurrences of a person, a place or a book in the whole edition. One record can build a set of ontological relations, and the user can follow a theme, while moving between texts. A search for a person will provide the user with information about where Topelius mentions this person, whether it is in a letter, in his diaries or in a textbook for schoolchildren, or if he possibly meets or interacts with the person. Furthermore, the user can see if this person was the author, publisher or perhaps translator of a book mentioned by Topelius in his texts, or if the editors of ZTS have used the book as a source for editorial comments. The user will also be able to get a list of letters the person wrote to or received from Topelius. The geographic index can help the user create a geographic ontology with an overview of Topelius’ whereabouts through the annotated mentions of places in Topelius’ diaries, letters and manuscripts.

The chronology creates a base for a timeline that will not only give the user key events from Topelius’ life, but also links to the other database records. Encoded dates in the XML files (letters, diaries, lectures, manuscripts etc.) can lead the user directly to the relevant text passages.

The relation between the bibliographic database and the manuscript database creates a complete bibliography over everything Topelius wrote, including all known manuscripts and editions that relate to a specific work. So far, there are 900 registered independent works by Topelius in the bibliographic database; these works are implemented in 300 published editions (manifestations) and 2 900 text versions included in those manifestations or in other independent manifestations. The manuscript database consists of 1 400 manuscript texts. The FRBR model offers different ways of structuring the layout of a bibliography according to the user’s needs, either through the titles of the abstract works with subordinate manifestations, or directly through the separate manifestations. The bibliography can be limited to show only editions published during Topelius’ lifetime, or to include later editions as well. Furthermore, the bibliography points the user to the published texts and manuscripts of a specific work in the ZTS edition and to text passages where the author himself discusses the work in question.

The level of detail is high in the records. For example, we register different name forms and spellings (Warschau vs Warszawa). Such information is included in the index search function and thereby eliminates problems for the end user trying to find information. Topelius often uses many different forms and abbreviations, and performing an advanced search in the texts would seldom give a comprehensive result in these cases. The letter database includes reference words describing the contents of the correspondences. Thus, the possibilities for searching in the material are expanded beyond the wordings of the original texts.

Pihlflyckt-Approaching a digital scholarly edition through metadata-153_a.pdf
Pihlflyckt-Approaching a digital scholarly edition through metadata-153_c.pdf

Poster [publication ready]

A Tool for Exploring Large Amounts of Found Audio Data

Per Fallgren, Zofia Malisz, Jens Edlund

KTH Royal Institute of Technology,

We demonstrate a method and a set of open source tools (beta) for non-sequential browsing of large amounts of audio data. The demonstration will contain first versions of a set of varied functionalities in their first stages, and will provide a good insight in how the method can be used to browse through large quantities of audio data efficiently.

Fallgren-A Tool for Exploring Large Amounts of Found Audio Data-229_a.pdf

Poster [publication ready]

The PARTHENOS Infrastructure

Sheena Dawn Bassett


PARTHENOS around two ERICs from the Humanities and Arts sector, DARIAH and CLARIN, along with ARIADNE, EHRI, CENDARI, CHARISMA and IPERION-CH and will deliver guidelines, standards, methods, pooled services and tools to be used by its partners and all the research community. Four broad research communities are addressed – History, Linguistic Studies, Archaeology, Heritage and Applied Disciplines and the Social Sciences. By identifying the common needs, PARTHENOS will support cross disciplinary research and provide innovative solutions.

By applying the FAIR data principles to structure the work on common policies and standards, the project has produced tools to assist researchers to find and apply the appropriate ones for their areas of interest. A virtual research environment will enable the discovery and use of data and tools and further support is provided with a set of online training modules.

Bassett-The PARTHENOS Infrastructure-107_a.pdf

Poster [abstract]

Using rolling.classify on the Sagas of Icelanders: Collaborative Authorship in Bjarnar saga Hítdælakappa

Daria Glebova

Russian State Academy of Science, Institute of Slavonic Studies

This poster will present the results of an application of the rolling.classify function in Stylo (R) to the source with an unknown authorship and extremely poor textual history – Bjarnar saga Hítdælakappa, one of the medieval Sagas of Icelanders. This case study sets the usual for Stylo authorship attribution goal aside and concentrates on the composition of the main witness of Bjarnar saga, ms. AM 551 d α, 4to (17th c.), which was the source for the most of Bjarnar saga existing copies. It aims not only to find and visualise new arguments for the working hypothesis about the AM 551 d α, 4to composition but also to touch upon main questions that rise before a student of philology daring to use Stylo on the Old Icelandic saga ground, i.e. what Stylo tells us, what it does not, and how can one use it while exploring the history of a text that exists only in one source.

It has been noticed that Bjarnar saga shows signs of a stylistic change between the first 10 chapters and the rest of the saga – the characters suddenly change their behaviour (Sígurður Nordal 1938, lxxix; Andersson 1967, 137-140), the narrative becomes less coherent and, as it seems, acquires a new logic of construction (Finlay 1990-1993, 165-171). More detailed narrative analysis of the saga showed that there is a difference in the usage of some narrative techniques in the first and the second parts, i.e., for example, the narrator’s work with point of view and the amount of their intervention in the saga text (Glebova 2017, 45-57). Thus, the question is – what is the relationship between the first 10 chapters and the rest of Bjarnar saga? Is the change entirely compositional and motivated by the narrative strategy of the medieval compiler or it is actually a result of a compilation of two texts that have two different authors?

As it often happens with sagas, the problem aggravates due to the Bjarnar saga poor preservation. There is not much to compare and work with; the most of the saga witnesses are copies from one 17th c. manuscript, AM 551 d α, 4to (Boer 1893, xii-xiv; Sígurður Nordal 1938, xcv-xcvii; Simon 1966 (I), 19-149). This manuscript also has its flaws as it has two lacunae, one in the very beginning of the saga (ch. 1-5,5 in ÍF III) and another in the middle (between ch. 14-15 in ÍF III). The second lacuna is unreconstructable while the first one is usually substituted by a fragment from the saga’s short reduction that was preserved in copies of 15th c. kings’ saga compilation, Separate saga St. Olaf in Bœjarbók (Finlay 2000, xlvi), and that actually ends right on the 10th chapter of the longer version. It seems that the text of the shorter version is a variant of the longer one (Glebova 2017, 13-17) and it has a reference that there has been more to the story but it was shortened; precise relationships between the short and long reductions, however, are impossible to reconstruct due to the lacuna in AM 551 d α, 4to. The existence of the short version with these particular length and contents is indeed very important to the study of Bjarnar saga composition in AM 551 d α, 4to as it creates a chance that the first 10 chapters of AM 551 d α, 4to could exist separately at some point of the Bjarnar saga’s text history or at least that these chapters were seen by the medieval compilers as something solid and complete. This would be the last word of the traditional philology concerning this case – the state of the sources does not allow saying more. Thus, is there anything else that could shed some light on the question whether these chapters existed separately or they were written by the same hand?

In this study it was decided to try sequential stylometric analysis available in Stylo package for R (Eder, Kestemont, Rybicki 2013) as a function rolling.classify (Eder 2015). As we are interested in the different parts of the same text, rolling stylometry seems to be a more preferable method than cluster analysis, which takes the whole text as an entity and compares it to the reference corpus; alternatively, in case with rolling stylometry the text is divided into smaller segments that allows a deeper investigation of the stylistic variation in the text itself (Rybicki, Eder, Hoover 2016, 126). To do the analysis there was made a corpus from the two parts of Bjarnar saga and several other Old Icelandic sagas; the whole corpus was taken from in Modern Icelandic normalised orthography. Several tests were conducted, first, with one of the parts as a test set and then with another; a sample size from 5000 words to 2000. The preliminary results show that there is a stylistic division in the saga as the style of the first part is not present in the second one and vice versa.

This would be an additional argument for the idea that the first 10 chapters existed separately and were added by the Bjarnar saga compiler during the saga construction. One could argue that it could be not an authorial but a generic division as the first part is set in Norway and deals a lot with St. Olaf; the change of genre could result in the change of style. However, Stylo counts the most frequent words, which are not so generically specific (like og, að, etc.); thus, the collaborative authorship still could have taken place. This would be an important result in context of the overall composition of the Bjarnar saga longer version as its structure shows traces of a very careful planning and also mirror composition (Glebova 2017, 18-33): could it be that the structure of one of the parts (maybe, the first one) influenced the other? Whatever be the case, while sewing together the existing material, the medieval compiler made an effort to create a solid text and this effort is worth studying with more attention.


Andersson, Theodor M. (1967). The Icelandic Family Saga: An Analytic Reading. Cambridge, MA.

Boer, Richard C. (1893). Bjarnar saga Hítdælakappa, Halle.

Eder, M. (2015). “Rolling Stylometry.” Digital Scholarship in the Humanities, Vol. 31-3: 457–469.

Eder, M., Kestemont, M., Rybicki, J. (2013). “Stylometry with R: A Suite of Tools.” Digital Humanities 2013: Conference Abstracts. University of Nebraska–Lincoln: 487–489.

Finlay, A. “Nið, Adultery and Feud in Bjarnar saga Hítdælakappa.” Saga-Book of the Viking Society 23 (1990-1993): 158-178.

Finlay, A. The Saga of Bjorn, Champion of the Men of Hitardale, Enfield Lock, 2000.

Glebova D. A Case of An Odd Saga. Structure in Bjarnar saga Hítdælakappa. MA thesis, University of Iceland. Reykjavík, 2017 (

Rybicki, J., Eder, M., Hoover, David L. “Computational Stylistics and Text Analysis.” In Doing Digital Humanities: Practice, Training, Research, edited by Constance Compton, Richard J. Lane, Ray Siemens. London, New York: 123-144.

Sigurður Nordal, and Guðni Jónsson (eds.) “Bjarnar saga Hítdælakappa.” In Borgfirðinga sögur, Íslenzk fornrit 3, 111-211. Reykjavík, 1938.

Simon, John LeC. A Critical Edition of Bjarnar saga Hítdælakappa. Vol. 1-2. Unpublished PhD thesis, University of London, 1966.

Glebova-Using rollingclassify on the Sagas of Icelanders-244_a.pdf
Glebova-Using rollingclassify on the Sagas of Icelanders-244_c.pdf

Poster [abstract]

The Bank of Finnish Terminology in Arts and Sciences – a new form of academic collaboration and publishing

Johanna Enqvist, Tiina Onikki-Rantajääskö

University of Helsinki,

This presentation concerns the multidisciplinary research infrastructure project “Bank of Finnish Terminology in Arts and Sciences (BFT)” as an innovative form of academic collaboration and publishing. The BFT, which was launched in 2012, aims to build a permanent and continuously updated terminological database for all fields of research in Finland. Content for the BFT is created by niche-sourcing, where the participation is limited to a particular group of experts in the participating subject fields. The project maintains a wiki-based website which offers an open and collaborative platform for terminological work and a discussion forum available to all registered users.

The BFT thus opens not only the results but the whole academic procedure where the knowledge is constantly produced, evaluated, discussed and updated in an ongoing process. The BFT also provides an inclusive arena for all the interested people – students, journalists, translators and enthusiasts – to participate in the discussions relating to concepts and terms in Finnish research. Based on the knowledge and experiences accumulated during the BFT project we will reflect on the benefits, challenges, and future prospects of this innovative and globally unique approach. Furthermore, we will consider the possibilities and opportunities opening up especially in terms of digital humanities.

Enqvist-The Bank of Finnish Terminology in Arts and Sciences – a new form of academic collaboration and p_a.pdf
Enqvist-The Bank of Finnish Terminology in Arts and Sciences – a new form of academic collaboration and p_c.pdf

Poster [publication ready]

The Swedish Language Bank 2018: Research Resources for Text, Speech, & Society

Lars Borin1, Markus Forsberg1, Jens Edlund2, Rickard Domeij3

1University of Gothenburg; 2KTH Royal Institute of Technology; 3The Institute for Language and Folklore

We present an expanded version of the Swedish research resource the Swedish Language Bank. The Language Bank, which has supported national and inter-national research for over four decades, will now add two branches, one focus-ing on speech and one on societal aspect of language, to its existing organiza-tion, which targets text.

Borin-The Swedish Language Bank 2018-269_a.pdf

Poster [abstract]

Handwritten Text Recognition and 19th Century Court Records

Maria Kallio

National Archives Finland,

This paper will demonstrate how the READ project is developing new technologies that will allow computers to automatically process and search handwritten historical documents. These technologies are brought together in the Transkribus platform, which can be downloaded free of charge at Transkribus enables scholars with no in-depth technological knowledge to freely access and exploit algorithms which can automatically process handwritten text. Although there is already a rather sound workflow in place, the platform needs human input in order to ensure the quality of the recognition. The technology must be trained by being shown examples of images of documents and their accurate transcriptions. This helps it to understand the patterns which make up characters and words. This training data is used to create a Handwritten Text Recognition model which is specific to a particular collection of documents. The more training data there is, the more accurate the Handwritten Text Recognition can become.

Once a Handwritten Text Recognition model has been created, it can be applied to other pages from the same collection of documents. The machine analyses the image of the handwriting and then produces textual information about the words and their position on the page, providing best guesses and alternative suggestions for each word, with measures of confidence. This process allows Transkribus to provide the automatic transcription and full-text search of a document collection at high levels of accuracy.

For the quality of the text recognition, the amount of training material is paramount. Current tests suggest that models for specific style of handwriting can reach a Character Error Rate of less than 5%. Transcripts with a Character Error Rate of 10% or below can be generally understood by humans and used for adequate keyword searches. A low Character Error Rate also makes it relatively quick and easy for human transcribers to correct the output of the Handwritten Text Recognition engine. These corrections can then be fed back into the model in order to make it more accurate. These levels also compare favorably with Optical Character Recognition, where 95-98% accuracy for early prints is possible.

Of even more interest is the fact that a well-trained model is able to sustain a certain amount of differences in handwriting. Therefore, it can be expected that, with a large amount of training material, it will be possible to recognize the writing of an entire epoch (e.g. eighteenth-century English writing), in addition to that of specific writers.

The case study of this paper is the Finnish court records from the 19th century. The notification records which contain cases concerning guardianships, titles and marriage settlements, form an enormous collection of over 600 000 pages. Although the material is in digital form, the usability is still poor due to the lack of indices or finding aids. With the help of the Handwritten Text Recognition the National Archives have the chance to provide the material in computer-readable form which allows users to search and use the records in whole new way.

Kallio-Handwritten Text Recognition and 19th Century Court Records-246_a.docx

Poster [publication ready]

An approach to unsupervised ontology term tagging of dependency-parsed text using a Self-Organizing Map (SOM)

Seppo Nyrkkö

University of Helsinki

Tagging ontology-based terms on existing text content is a task often requiring human effort. Each ontology may have their own structure and schema for describing terms, making automation non-trivial. I suggest a machine learning estimation technique for term tagging which can learn semantic tagging from a set of sample ontologies with given textual examples, and expand its use for analyzing a large text corpus by comparing the found syntactic features in the text. The tagging technique is based on a dependency parsed text input and an unsupervised machine learning model, the Self-Organizing Map (SOM).

Nyrkkö-An approach to unsupervised ontology term tagging-147_a.pdf
Nyrkkö-An approach to unsupervised ontology term tagging-147_c.pdf

Poster [abstract]

Comparing Topic Model Stability Between Finnish, Swedish and French

Simon Hengchen, Antti Kanner, Eetu Mäkelä, Jani Marjanen

University of Helsinki

Comparing Topic Model Stability Between Finnish, Swedish and French

1 Abstract

In the recent years, topic modelling has gained increasing attention in the humanities.

Unfortunately, little has been done to determine whether the output produced by this range of probabilistic algorithms is revealing signal or merely producing noise, nor how well it performs on other languages than English.

In this paper, we set out to compare topic models of parallel corpora in Finnish, Swedish, and French, and propose a method to determine how well the topic modelling algorithms perform on those languages.

2 Context

Topic modelling (TM) is a well-known (following the work of (4; 5)) yet badly understood range of algorithms within the humanities.

While a variety of studies within the humanities make use of topic models to answer historical questions (see (2) for a thorough survey), there is no tried and true method that ascertains that the probabilistic algorithm reveals signal and is not merely responding to noise.

The rule of thumb is generally that if the results are interesting and reveal a prior intuition by a domain expert, they are considered correct -- in the sense

that they are a valid entry point into a humongous dataset, and that the proper work of historical research is to be then manually carried out on a subset selected by the algorithm.

As pointed out in previous work (7; 3), this, combined with the fact that many humanistic corpora are on the small side, "the threshold for the utility of topic modelling across DH projects is as yet highly unclear."

Similarly, topic instability "may lead to research being based on incorrect foundational assumptions regarding the presence or clustering of conceptual fields on a body of work or source material" (3).

Whilst topic modelling techniques are considered language-independent, i.e. "use[] no manually constructed dictionaries, knowledge bases, semantic networks, grammars, syntactic parsers, or morphologies, or the like" (6), they encode keyassumptions about the statistical properties of language.

These assumptions are often developed with English in mind and generalised to other languages without much consideration.

We maintain that these algorithms are not language-independent, but language-agnostic at best, and that accounting for discrepancies in how different languages are processed by the same algorithms is necessary basic research for more applied, context-oriented research -- especially for the historical development of public discourses in multilingual societies or phenomena where structures of discourse flow over language borders.

Indeed, some languages heavily rely on compounding -- the creation of a word through the combination of two or more stems -- in word formation, while others use determiners to combine simple words.

If one considers a white space as the delimitation between words (as is usually done with languages making use of the Latin alphabet), the first tendency results in a richer vocabulary than the second, hence influencing TM algorithms that follow of the bag-of-words approach.

Similarly, differences in grammar -- for example, French adjectives must agree in gender and number with the noun they modify, something that does not exist in other languages like English -- reinforce those discrepancies.

Nonetheless, most of this happens in the fuzzy and non-standard preprocessing stage of topic modelling, and the argument could be made that the language neutrality of TM algorithms rests more on it being underspecified with regard to how to pre-process the language.

In this paper, we propose to compare topic models on a custom-made parallel corpus in Finnish, Swedish, and French.

By selecting those languages, we have a glimpse of how a selection of different languages are processed by TM algorithms.

While concentrating on languages spoken in Europe and languages of interest of our collaborative network of linguists, historians and computer scientists, we are still able examine two crucial variables: one of genetic and one of cultural relatedness.

French and Swedish belong to Indo-European (Romance and Germanic branches, respectively) and Finnish is a Finno-Ugrian language.

Finnish and Swedish on the other hand share a long history of close language contact and cultural convergence.

Because of this, Finnish contains a large number of Swedish loan words, and, perceivably, similar conceptual systems.

3 Methodology

To explore our hypothesis, we use a parallel corpus of born-digital textual data in Finnish, Swedish, and French.

Once the corpus is constituted, it becomes possible to apply LDA (1) and HDA (9) -- LDA is parametrised by humans, whereas HDA will attempt to automatically determine the best configuration possible.

The resulting models for each language are stored, the corpora reduced in size, LDA is re-applied, the models are stored, corpora re-reduced, etc.

Topic models are compared manually between languages at each stage, and programmatically between stages, using the Jaccard Index (8), for all languages.

The same workflow is then applied to the lemmatised version of the above-mentioned corpora, and results compared.


[1] Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993{1022 (2003)

[2] Brauer, R., Fridlund, M.: Historicizing topic models, a distant reading of topic modeling texts within historical studies. In: International Conference on Cultural Research in the context of \Digital Humanities", St. Petersburg: Russian State Herzen University (2013)

[3] Hengchen, S., O'Connor, A., Munnelly, G., Edmond, J.: Comparing topic model stability across language and size. In: Proceedings of the Japanese Association for Digital Humanities Conference 2016 (2016)

[4] Jockers, M.L.: Macroanalysis: Digital methods and literary history. University of Illinois Press (2013)

[5] Jockers, M.L., Mimno, D.: Significant themes in 19th-century literature. Poetics 41(6), 750{769 (2013)

[6] Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse processes 25(2-3), 259{284 (1998)

[7] Munnelly, G., O'Connor, A., Edmond, J., Lawless, S.: Finding meaning in the chaos (2015)

[8] Real, R., Vargas, J.M.: The probabilistic basis of jaccard's index of similarity. Systematic biology 45(3), 380{385 (1996)

[9] Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. Journal of the American Statistical Association 101(476), 1566{1581 (2006)

Hengchen-Comparing Topic Model Stability Between Finnish, Swedish and French-231_a.pdf
Hengchen-Comparing Topic Model Stability Between Finnish, Swedish and French-231_c.pdf

Poster [abstract]

ARKWORK: Archaeological practices and knowledge in the digital environment

Suzie Thomas2, Isto Huvila1, Costis Dallas3, Rimvydas Laužikas4, Antonia Davidovic9, Arianna Traviglia6, Gísli Pálsson7, Eleftheria Paliou8, Jeremy Huggett5, Henriette Roued6

1Uppsala University,; 2University of Helsinki; 3University of Toronto; 4Vilnius University; 5University of Glasgow; 6University of Venice; 7Umeå University; 8University of Copenhagen; 9Independent researcher

Archaeology and material cultural heritage have often enjoyed a particular status as a form of heritage that has captured the public imagination. As researchers from many backgrounds have discussed, it has become the locus for the expression and negotiation of European, local, regional, national and intra-national cultural identities, for public policy regarding the preservation and management of cultural resources, and for societal value in the context of education, tourism, leisure and well-being. The material presence of objects and structures in European cities and landscapes, the range of archaeological collections in museums around the world, the monumentality of the major archaeological sites, and the popular and non-professional interest in the material past are only a few of the reasons why archaeology has become a linchpin in the discussions on how emerging digital technologies and digitization can be leveraged for societal benefit. However, at the time when nations and the European community are making considerable investments in creating technologies, infrastructures and standards for digitization, preservation and dissemination of archaeological knowledge, critical understanding of the means and practices of knowledge production in and about archaeology from complementary disciplinary perspectives and across European countries remains fragmentary, and in urgent need of concertation.

In contrast to the rapid development of digital infrastructures and tools for archaeological work, relatively little is known about how digital information, tools and infrastructures are used by archaeologists and other users and producers of archaeological information such as archaeological and museum volunteers, avocational hobbyists, and others. Digital technologies (infrastructures, methods and resources) are reconfiguring aspects of archaeology across and beyond the lifecycle (i.e., also "in the wild"), from archaeological data capture in fieldwork to scholarly publication and community access/entanglement.Both archaeologists and researchers in other fields, from disciplines such as museum studies, ethnology, anthropology, information studies and science and technology studies have conducted research on the topic but so far, their efforts have tended to be somewhat fragmented and anecdotal. This is surprising, as the need of better understanding of archaeological practices and knowledge work has been identified for many years as a major impediment to realizing the potential of infrastructural and tools-related developments in archaeology. The shifts in archaeological practice, and in how digital technology is used for archaeological purposes, calls for a radically transdisciplinary (if not interdisciplinary) approach that brings together perspectives from reflexive, theoretically and methodologically-aware archaeology, information research, and sociological, anthropological and organizational studies of practice.

This poster presents the COST Action “Archaeological practices and knowledge work in the digital environment” ( - ARKWORK), an EU-funded network which brings together researchers, practitioners, and research projects studying archaeological practices, knowledge production and use, social impact and industrial potential of archaeological knowledge to present and highlight the on-going work on the topic around Europe.

ARKWORK ( consists of four Working Groups (WGs), with a common objective to discuss and practice the possibilities for applying the understanding of archaeological knowledge production to tackle on-going societal challenges and the development of appropriate management/leadership structures for archaeological heritage. The individual WGs have the following specific but complementary themes and objectives:

WG1 - Archaeological fieldwork

Objectives: To bring together and develop the international transdisciplinary state-of-the-art of the current multidisciplinary research on archaeological fieldwork. How archaeologists are conducting fieldwork and documenting their work and findings in different countries and contexts and how this knowledge can be used to make contributions to developing fieldwork practices and the use and usability of archaeological documentation by the different stakeholder groups in the society.

WG2 - Knowledge production and archaeological collections

Objectives: To integrate and push forward the current state-of-the-art in understanding and facilitating the use and curation of (museum) collections and repositories of archaeological data for knowledge production in the society.

WG3 - Archaeological knowledge production and global communities

Objectives: To bring together and develop the current state-of-the-art on the global communities (including indigenous communities, amateurs, neo-paganism movement, geographical and ideological identity networks and etc.) as producers and users in archaeological knowledge production e.g. in terms of highlighting community needs, approaches to communication of archaeological heritage, crowdsourcing and volunteer participation.

WG4 - Archaeological scholarship

Objectives: To integrate and push forward the current state-of-the-art in study of archaeological scholarship including academic, professional and citizen science based scientific and scholarly work.

In our poster we outline each of the working groups and provide a clear overview of the purposes and aspirations of the COST Action Network ARKWORK


Poster [publication ready]

Research and development efforts on the digitized historical newspaper and journal collection of The National Library of Finland

Kimmo Kettunen, Mika Koistinen, Teemu Ruokolainen

University of Helsinki, Finland,

The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12 million pages mainly in Finnish and Swedish. Out of these about 5.1 million pages are freely available on the web site (Digi). The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1920. The last ten years, 1911–1920, were opened in February 2017.

The digitized collection of NLF is part of globally expanding network of library produced historical data that offers researchers and lay persons insight into past. In 2012 it was estimated that there were about 129 million pages and 24 000 titles of digitized newspapers in Europe [1]. A very conservative estimation about worldwide number of titles is 45 000 [2]. The current number of available data is probably already much bigger, as the national libraries have been working steadily with digitization both in Europe, Northern America and rest of the world.

This paper presents work that has been carried out in the NLF related to the historical newspaper and journal collection. We offer an overall account of research and development related to the data.

Kettunen-Research and development efforts on the digitized historical newspaper and journal collection-111_a.pdf
Kettunen-Research and development efforts on the digitized historical newspaper and journal collection-111_c.pdf

Poster [abstract]

Medieval Publishing from c. 1000 to 1500

Samu Kristian Niskanen, Lauri Iisakki Leinonen

Helsinki University

Medieval Publishing from c. 1000 to 1500 (MedPub) is a five-year project funded by the European Research Council, based at Helsinki University, and running from 2017 2022. The project seeks to define the medieval act of publishing, focusing on Latin authors active during the period from c. 1000 to 1500. A part of the project is to establish a database of networks of publishing. The proposed paper will discuss the main aspects of the projected database and the process of data-gathering.

MedPub’s research hypothesis is that publication strategies were not a constant but were liable to change, and that different social, literary, institutional, and technical milieux fostered different approaches to publishing. As we have already proved this proposition, the project is now advancing toward the next step, the ultimate aim of which is to complement the perception of societal and cultural changes that took place during the period from c. 1000 and 1500.

For the purposes of that undertaking, we define ‘publishing’ as a social act, involving at least two parties, an author and an audience, not necessarily always brought together. The former prepares a literary work and then makes it available to the latter. Medieval publishing was probably more often a more complex process. It could engage more parties than the two, such as commentators, dedicatees, and commissioners. The social status of these networks ranged from mediocre to grand. They could consist of otherwise unknown monks; or they could include popes and emperors.

We propose that the composition of such literary networks was broadly reactive to large-scale societal and cultural changes. If so, networks of publishing can serve as a vantage point for the observation of continuity and change in medieval societies. We shall collect and analyse an abundance of data of publishing networks in order to trace how their composition in various contexts may reflect the wider world. It is that last-mentioned aspect that is the subject of this proposal.

It is a central fact for this undertaking that medieval works very often include information on dedication, commission, and commendation; and that, more often than not, this evidence is uncomplicated to collect because the statements in question tend to be short and uniform and they normally appear in the prefaces and dedicatory letters with which medieval authors often opened their works. What is more, such accounts manifestly indicate a bond between two or more parties. By virtue of these features, the evidence in question can be collected in the quantities needed for large-scale statistical analysis and processed electronically. The function and form of medieval references to dedication and commission, furthermore, remained largely a constant. Eleventh-century dedications resemble those from, say, the fourteenth century. By virtue of such uniformity the data of dedications and commissions may well constitute a unique pool of evidence of social interaction in the Middle Ages. For the data of dedications and commissions can be employed as statistical evidence in various regional, chronological, social, and institutional contexts, something that is very rare in medieval studies.

The proposed paper will introduce the categories of information the database is to embrace and put forward for discussion the modus operandi of how the data of dedications and commissions will be harvested.

Niskanen-Medieval Publishing from c 1000 to 1500-243_a.pdf

Poster [abstract]

Making a bibliography using metadata

Lars Bagøien Johnsen, Arthur Tennøe

National Library of Norway, Norway,

In this presentation we will discuss how one might create a bibliography using metadata taken from libraries in conjunction with other sources. As metadata, like topic keywords and Dewey decimal classification, is digitally available our focus is on metadata, although we also look at book contents where it is possible.

Johnsen-Making a bibliography using metadata-235_a.pdf

Poster [abstract]

Network Analysis, Network Modeling, and Historical Big Data: The New Networks of Japanese Americans in World War II

Saara Kekki

University of Helsinki

Network analysis has become a promising methodology for studying a wide variety of systems, including historical populations. It brings new dimensions into the study of questions that social scientists and historians might traditionally ask, and allows for new questions that were previously impractical or impossible to answer using traditional methods. The increasing availability of digitized archival material and big data, however, are making it more appealing. When coupled with custom algorithms and interactive visualization tools, network analysis can produce remarkable new insights.

In my ongoing doctoral research, I am employing network analysis and modeling to study the Japanese American incarceration in World War II (internment). Incarceration and the government-led dispersal of Japanese Americans disrupted the lives of some 110,000 people, including over 70,000 US citizens of Japanese ancestry, for the duration of the war and beyond. Many lost their former homes and enterprises and had to start their lives over after the war. Incarceration also had a very concrete impact on the communities: about 50% of those interned did not return to their old homes.

This paper explores the changes that took place in the Japanese American community of Heart Mountain Relocation Center in Wyoming. I will especially investigate the political networks and power relations of the incarceration community. My aim is twofold: on the one hand, to discuss the changes in networks caused by incarceration and dispersal, and on the other, to address some opportunities and challenges presented by the method for the study of history.

Kekki-Network Analysis, Network Modeling, and Historical Big Data-116_a.pdf
Kekki-Network Analysis, Network Modeling, and Historical Big Data-116_c.pdf

Poster [abstract]

SuALT: Collaborative Research Infrastructure for Archaeological Finds and Public Engagement through Linked Open Data

Suzie Thomas1, Anna Wessman1, Jouni Tuominen2,3, Mikko Koho2, Esko Ikkala2, Eero Hyvönen2,3, Ville Rohiola4, Ulla Salmela4

1University of Helsinki,Department of Philosophy, History, Culture and Art Studies; 2Aalto University, Semantic Computing Research Group (SeCo); 3University of Helsinki, HELDIG – Helsinki Centre for Digital Humanities; 4National Board of Antiquities, Library, Archives and Archaeological Collections

The Finnish Archaeological Finds Recording Linked Database (Suomen arkeologisten löytöjen linkitetty tietokanta – SuALT) is a concept for a digital web service catering for discoveries of archaeological material made by the public; especially, but not exclusively, metal detectorists. SuALT, a consortium project funded by the Academy of Finland and commenced in September 2017, has key outputs at every stage of its development. Ultimately it provides a sustainable output in the form of Linked Data, continuing to facilitate new public engagements with cultural heritage, and research opportunities, long after the project has ended.

While prohibited in some countries, metal detecting is legal in Finland, provided certain rules are followed, such as prompt reporting of finds to the appropriate authorities and avoidance of legally-protected sites. Despite misgivings by some about the value of researching metal-detected finds, others have demonstrated the potential of researching such finds, for example uncovering previously unknown artefact typologies. Engaging non-professionals with cultural heritage also contributes to the democratization of archaeology, and empowers citizens. In Finland metal detecting has grown rapidly in recent years. In 2011 the Archaeological Collections registered 31 single or assemblages of stray finds. In 2014, over 2700 objects were registered, in 2015, near 3000. In 2016 over 2500 finds were registered. When the finds are reported correctly, their research value is significant. The Finnish Antiquities Act §16 obligates the finder of an object for which the owner is not known, and which can be expected to be at least 100 years old, to submit or report the object and associated information to the National Board of Antiquities (Museovirasto – NBA); the agency responsible for cultural heritage management in Finland. There is also a risk, as finders get older and even pass away, that their discoveries and collections will remain unrecorded and that all associated information is lost permanently.

In the current state of the art, while archaeologists increasingly use finds information and other data, utilization is still limited. Data can be hard to find, and available open data remains fragmented. SuALT will speed up the process of recording finds data. Because much of this data will be from outside of formal archaeological excavations, it may shed light on sites and features not usually picked up through ‘traditional’ fieldwork approaches, such as previously unknown conflict sites. The interdisciplinary approach and inclusion of user research promotes collaboration among the infrastructure’s producers, processors and consumers. By linking in with European projects, SuALT enables not only national and regional studies, but also contributes to international and transnational studies. This is significant for studies of different archaeological periods, for which the material culture usually transcends contemporary national boundaries. Ethical aspects are challenged due to the debates around engagement with metal detectorists and other artefact hunters by cultural heritage professionals and researchers, and we address head-on the wider questions around data sharing and knowledge ownership, and of working with human subjects. This includes the issues, as identified by colleagues working similar projects elsewhere, around the concerns of metal detectorists and other finders about sharing findspot information. Finally, the usability of datasets has to be addressed, considering for example controlled vocabulary to ease object type categorization, interoperability with other datasets, and the mechanics of verification and publication processes.

The project is unique in responding to the archaeological conditions in Finland, and in providing solutions to its users’ needs within the context of Finnish society and cultural heritage legislation. While it focuses primarily on the metal detecting community, its results and the software tools developed are applicable more generally to other fields of citizen science in cultural heritage, and even beyond. For example, in many areas of collecting (e.g. coins, stamps, guns, or art), much cultural heritage knowledge as well as collections are accumulated and maintained by skilful amateurs and private collectors. Fostering collaboration, and integrating and linking these resources with those in national memory organizations would be beneficial to all parties involved, and points to future applications of the model developed by SuALT. Furthermore, there is scope to integrate SuALT into wider digital humanities networks such as DARIAH (

Framing SuALT’s development as a consortium enables us to ask important questions even at development stages, with the benefit of expertise from diverse disciplines and research environments. The benefits of SuALT, aside from the huge potential for regional, national, and transnational research projects and international collaboration, are that it offers long term savings on costs, shares expertise and provides greater sustainability than already possible. We will explore the feasibility of publishing the finds data through international aggregation portals, such as Europeana ( for cultural heritage content, as well as working closely with colleagues in countries that already have established national finds databases. The technical implementation also respects the enterprise architecture of Finnish public government. Existing Open Source solutions are further developed and integrated, for example the GIS platform ( for geodata developed by the National Land Survey with the Linked Data based Finnish Ontology Service of Historical Places and Maps ( SuALT’s data is also disseminated through Finna (, a leading service for searching cultural information in Finland.

SuALT consists of three subprojects: subproject I “User Needs and Public Cultural Heritage Interactions” hosted by University of Helsinki; subproject II “National Linked Open Data Service of Archaeological Finds in Finland” hosted by Aalto University, and subproject III “Ensuring Sustainability of SuALT” hosted by the NBA.

The primary aim of SuALT is to produce an open Linked Data service which is used by data producers (namely the metal detectorists and other finders of archaeological material), by data researchers (such as archaeologists, museum curators and the wider public), and by cultural heritage managers (NBA). More specifically, the aims are:

a. To discover and analyse the needs of potential users of the resource, and to factor these findings into its development;

b. To develop metadata models and related ontologies for the data that take into account the specific needs of this particular infrastructure, informed by existing models;

c. To develop the Linked Data model in a way that makes it semantically interoperable with existing cultural heritage databases within Finland;

d. To develop the Linked Data model in a way that makes it semantically interoperable with comparable ‘finds databases’ elsewhere in Europe, and

e. To test the data resulting from SuALT through exploratory research of the datasets for archaeological research purposes for cultural heritage and collection management work.

The project corresponds closely with the strategic plans of the NBA and responds to the growth of metal detecting in Finland. Internationally, it corresponds with the development of comparable schemes in other European countries and regions, such as Flanders (MetaaldEtectie en Archeologie – MEDEA initiated in 2014), and Denmark and the Netherlands (Digitale Metaldetektorfund or DIgital MEtal detector finds – DIME, and Portable Antiquities in the Netherlands – PAN, both initiated in 2016). It takes inspiration from the Portable Antiquities Scheme (PAS) Finds Database ( in England and Wales. These all aspire to an ultimate goal of a pan-European research infrastructure, and will work together to seek a larger international collaborative research grant in the future. A contribution of our work in relation to the other European projects is to employ the Linked Data paradigm, which facilitates better interoperability with related datasets, additional data enrichment based on well-defined semantics and reasoning, and therefore better means for analysing and using the finds data in research and applications.

The expected scientific impacts are that the process of developing SuALT, including critically analysing comparable resources, user group research, and creating innovative solutions, will in themselves produce a rich body of interdisciplinary academic output. This will be disseminated in peer reviewed journals and at selected conferences across several disciplinary boundaries including Computer Science, Archaeology, and Cultural Heritage Studies. It also links in, at a crucial moment in the development of digital heritage management, with parallel resources elsewhere in Europe. This means that not only can a coordinated and international approach be taken in development, but that it is extremely timely, taking advantage of the opportunity to benefit from the experiences and perspectives of colleagues pursuing similar resources. SuALT ensures that Finnish cultural heritage management is at the forefront of digital heritage. The project also carries out a small-scale ‘test’ project using the database as it forms, and in this way contributes to the field of artefact studies. The contribution to future knowledge sits at a number of levels. There are technical challenges to create the linked database in a way that complements and is interoperable with existing national and international infrastructures. Solving these challenges generates contributions to understanding digital data management and service. The process of consulting users represents an important case study in formative evaluation of particular interest groups with regard to digital heritage and citizen science, as well as shedding further light on different perceptions and uses of cultural heritage. SuALT relates to the emerging trend of publishing open science data, facilitating the analysis and reuse of the data, exemplified by e.g. DataONE ( and Open Science Data Cloud (

We hypothesise that SuALT will result in a sustainable digital data resource that responds to the different user needs, and which provides high quality archaeological research which draws on data from Finland. SuALT also enables integration with comparative data from abroad. Outputs throughout the development process represent important contributions to research into digital heritage applications and semantic computing, going the needs of the scientific community. The selected Linked Data methodology is suitable for archaeology and cultural heritage management due to the need to combine and connect heterogeneous data collections in the field (e.g. museum collections, finds databases abroad) and other datasets, such as vocabularies of places, persons, and time periods, benefiting cultural heritage professionals. Publishing the finds database as open data using standardised metadata formats facilitates the data’s re-use, fostering new research by the scientific community but also the development of novel applications for professionals and citizens. Taking a strategic approach to the challenge of creating this resource, and treating it as a research project, rather than developing an ad hoc resource, ensures that the project’s legacy is a significant and long-term contribution to digital curation of public-generated archaeological data.

As its key societal impact, SuALT provides a vital interface for non-professionals to contribute to and benefit from Finland’s archaeological record, and to integrate this with comparable datasets from abroad. The project enhances cooperation between non-professionals and cultural heritage managers. Careful user research ensures that SuALT offers means of engagement and access to data and other information that is usable and meaningful to a wide range of users, from metal detectorists and amateur historians, through to professional curators, cultural heritage managers, and academic researchers, domestically and abroad. SuALT’s results are not limited to metal detection but have a wider impact: the same key challenges of engaging amateur collectors to collaborate with memory organization experts in citizen science are encountered in virtually all fields of collecting and maintaining tangible and intangible cultural heritage.

The process of developing SuALT provides an unprecedented opportunity to research the use of digital platforms to engage the public with archaeological heritage in Finland. Inspired by successful initiatives such as PAS and MEDEA, the potential for individuals to self-record their finds also echoes the emerging use of crowdsourcing for public archaeology initiatives. Thus, SuALT offers a significant opportunity to contribute to further understanding digital cultural heritage and its uses, including its role within society. It is likely that the coordination of SuALT with digital finds recording initiatives in other countries will lead to a transnational platform for finds recording, giving Finland an opportunity to be at the forefront of digital heritage-based citizen science research and development.

Thomas-SuALT Collaborative Research Infrastructure for Archaeological Finds and Public Engagement through_a.pdf

Poster [abstract]

Identifying poetry based on library catalogue metadata

Hege Roivainen

University of Helsinki,

Changes in printing reflect historical turning points: what has been printed, when, where and by whom are all derivatives of contemporary events and situations. Excessive need for war propaganda brings out more pamphlets from the printing presses, the university towns produce dissertations, which scientific development can be deduced from and strict oppression and censorship might allow only religious publications by government-approved publishers. The history of printing has been extensively studied and numerous monographs exist. However, most of the research has been qualitative studies based on close reading requiring a profound knowledge of the subject matter, yet still being unable to verify the extent of the new innovations. For example, close readings of library catalogues does not reveal, at least easily, the timeline of Luther’s publications, or what portion of books actually were octavo-sized and when the increase in this format occurred.

One of the sources for these kinds of studies are national library metadata catalogs which contain information about physical book size, page counts, publishers, publication places and so forth. These catalogs have been researched in ways making use of quantitative analysis. The advantage of national library catalogs is that they often are more or less complete, having records of practically everything published in a certain country or linguistic area in a certain time period. The computational approach to them has enabled researchers to connect historical turning points to the effect on printing, and the impact of a new concept has been measured against the amount of re-publications, or the spread, of a book introducing a new idea. What is more, linking library metadata to the full text of the books has made it possible to analyze the change in the usage of words in massive corpora, while still limiting analysis to relevant books.

In all these cases, computational methods work better the more complete the corpus is. However, library catalogues often lack annotations for one reason or another: annotating resources might have been cut at a certain point in time, or the annotation rules may have varied between different libraries in cases where catalogues have been amalgamated, or the rules could have just changed.

One area that is particularly important for subcorpora research is genre. The genre field, when annotated for each of the metadata records, could be used to restrict the corpus to contain every one of the books that are needed and nothing more. From this subset there is a possibility of drawing timelines or graphs based on bibliographic metadata, or in the case of full texts existing, the language or contents of a complete corpus could be analysed. Despite the significance of the genre information, that particular annotation bit is often lacking.

In English Short Title Catalogue (ESTC) the genre information exists for approximately one fourth of the records. This should be enough for teaching a model for machine learning and trying to deduce the genre information, rather than relying solely on the annotations of librarians. The metadata field containing genre information in ESTC can contain more than one value. In most cases this means having a category and its subcategories as different values, but not always. Because of the complex definition of genre in ESTC this paper focuses on one genre only: poetry. Besides being a relatively common genre, poetry is also of interest to literary researchers. Having a nearly complete subset of English poetry would allow for large-scale quantitative poetry analysis.

The downside to library metadata catalogues is, that they contain merely the metadata, not the complete unabridged texts, which would be beneficial for machine learning modeling. I tackled this shortcoming by creating several models each packed with similar features within that set. The main ingredient for these feature sets was a concatenation of the main title and the subtitle from the library metadata. From these concatenations I created one feature set contained easily calculable features known from the earliest stylometric research, such as word counts and sentence lengths. Another set I collected with bag-of-words method taking the frequencies of the most common words from a subset of poetry book titles. I also built one set for part-of-speech (POS) tags and another one for POS trigrams. Some feature sets were extracted from the other metadata fields. Physical book size, page count, topic and the same author having published a poetry book proved worthy in the classification.

From these feature sets I handpicked the best performing features into one superset. The resulting model performed really good: despite the compactness of the metadata, the poetry books could be tracked with a precision over 90% and a recall over 86%. I then made another run with the superset to seek the poetry books, which did not have genre field annotated in the catalogue. Combining the results from the run with close reading revealed over 14,000 unannotated poetry books. I sampled one hundred of both poetry and non-poetry books to manually estimate the correctness of the predictions and found out an annotation bias in the catalogue. The bias seems to come from the fact, that the genre information has been annotated more frequently for broadside poetry books, than for the other broadsides. Excluding broadsides from my samples I got a recall value 94% and precision 98%.

My research strongly suggest, that semi-supervised learning can be applied with library catalogues to fill in missing annotations, but this requires close attention to avoid possible pitfalls.

Roivainen-Identifying poetry based on library catalogue metadata-258_a.pdf

Poster [publication ready]

Open Digital Humanities: International Relations in PARTHENOS

Bente Maegaard

University of Copenhagen, CLARIN ERIC

One of the strong instruments for the promotion of Open Science in Digital Humanities is research infrastructures. PARTHENOS is a European research infrastructure project, basically built upon collaboration between two large the research infrastructures in the humanities CLARIN and DARIAH, plus a number of other initiatives. PARTHENOS aims at strengthening the cohesion of research in the broad sector of Linguistic Studies, Humanities, Cultural Heritage, History, Archaeology and related fields. This is the context in which we should see the efforts related to international liaisons. This effort takes its point of departure in the existing international relations, so the first action was to collect information and to analyse it along different dimensions. Secondly, we want to analyse the purpose and aims of international collaboration. There are many ideas about how the international network may be strengthened and exploited, so that higher quality is obtained, and more data, tools and services are shared. The main task of the next year will be to first agree on a strategy and then implement it in collaboration with the rest of the project. By doing so, the PARTHENOS partners will be contributing even more to the European Open Science Policies.

Maegaard-Open Digital Humanities-120_a.pdf

Poster [abstract]

The New Face of Ethnography: Utilizing Cyberspace as an Alternative Study Site

Karen Lisa Deeming

University of California, Merced,

American adoption has a familiar mission to find families for children but becomes strange when turned on its head and exposed as an institution that instead finds children for families who are willing to pay any price for a child. Its evolution, from orphan trains to open adoptions, has answered questions about biological associations but has conflated the interconnection of identity with conflicting narratives of community, kinship and self. How do the experiences of the adoption constellation reconceptualize the national image of adoption as a win-win solution to a social problem? My research explores the language utilized in multiple adoption narratives to determine individual and universal feelings that adoptees, birth parents, and adoptive parents experience regarding the transfer of children in the United States and the long term emotional outcomes for these groups. My unique approach to ethnographic research includes a hybrid digital and humanistic approach using online and offline interactions to gather data.

As is the case with all methodology, online ethnography presents both benefits and problems. On the plus side, online communities break down the walls of networks, creating digitally mediated social spaces. The Internet provides a platform for social interactions where real and virtual worlds shift and conflate. Social interactions in cybernetic environments present another option for social researchers and offer significant advantages for data collection, collaboration, and maintenance of research relationships. For some research subjects, such as members of the adoption constellation, locating target groups presents challenges for domestic adoption researchers. Online groups such as Facebook pages dedicated to specific members of the adoption triad offer a resolution to this challenge, acting as self-sorted focus groups with participants eager to provide their narratives and experiences. Ethnography involves understanding how people experience their lives through observation and non-directed interaction, with a goal of observing participants’ behavior and reactions on their own terms; this can be achieved through the presumed anonymity of online interaction. Electronic ethnography provides valuable insights and data; however, on the negative side, the danger of groupthink in Facebook communities can both attract and generate homogeneous experiences regarding adoption issues. I argue that the benefit of online ethnography outweighs the problems and can provide important, previously unexpressed views to better analyze topics such as the adoption experience. Social interactions in cybernetic environments offer significant advantages for data collection, collaboration, and maintenance of research relationships as it remains a fluid yet stable alternate social space.

Deeming-The New Face of Ethnography-126_a.docx

Late-Breaking Work

Elias Lönnrot Letters Online

Kirsi Keravuori, Maria Niku

Finnish Literature Society

The correspondence of Elias Lönnrot (1802–1884, doctor, philologist and creator of the national epic Kalevala) comprises of 2 500 letters or drafts written by Lönnrot and 3 500 letters received. Elias Lönnrot Letters Online (, first published in April 2017, is the conlusion of several decades of research, of transcribing and digitizing letters and of writing commentaries. The online edition is designed not only for those interested in the life and work of Lönnrot himself, but more generally to scholars and general public interested in the work and mentality of the Finnish 19th century nationalistic academic community , their language practices both in Swedish and in Finnish, and in the study of epistolary culture. The rich, versatile correspondence offers source material for research in biography, folklores studies and literary studies; for general history as well as medical history and the history of ideas; for the study of ego documents and networks; and for corpus linguistics and history of language.

As of January 2018, the edition contains about 2000 letters and drafts of letters sent by Elias Lönnrot (1802-1884, doctor, philologist and creator of the national epic Kalevala). These are mostly private letters. The official letters, such as the medical reports submitted by Lönnrot in his office as a physician, will be added during 2018. The final stage will involve finding a suitable way of publishing for the approximately 3500 letters that Lönnrot received.

The edition is built on the open-source publishing platform Omeka. Each letter and draft of letter is published as facsimile images and an XML/TEI5 file, which contains metadata and transcription. The letters are organised into collections according to recipient, with the exception of for example Lönnrot's family letters, which are published in a single collection. An open text search covers the metadata and transcriptions. This is a faceted search powered by Apache's Solr which allows limiting the initial search by collection, date, language, type of document and writing location. In addition, Omeka's own search can be used to find letters based on a handful of metadata fields.

The solutions adopted for the Lönnrot edition differ in some respects from the established practices of digital publishing of manuscripts in the humanities. In particular, the TEI encoding of the transcriptions is lighter than in many other scholarly editions. Lönnrot's own markings – underlinings, additions, deletions – and unclear and indecipherable sections in the texts are encoded, but place and personal names are not. This is partially due to the extensive amount of work such detailed encoding would require, partially because the open text search provides quick and easy access to the same information.

The guiding principle of Elias Lönnrot Letters is openness of data. All the data contained in the edition is made openly available.

Firstly, the XML/TEI5 files are available for download, and researchers and other users are free to modify them for their own purposes. The users can download the XML/TEI5 files of all the letters, or of a smaller section such as an individual collection. The feature is also integrated in the open text search, and can be used both for all the results produced by a search and a smaller section of the results limited by one or more facets. Thus, an individual researcher can download the XML files of the letters and study them for example with the linguistic tools provided by the Language Bank of Finland. Similarly, the raw data is available for processing and modifying by those researchers who use and develop digital humanities tools and methods to solve research questions.

Secondly, the letter transcriptions are made available for download as plain text. Data in this format is needed for qualitative analysis tools like Atlas. In addition, researchers in humanities do not all need XML files but will benefit from the ability to store relevant data in an easily readable format.

Thirdly, users of the edition can export the statistical data contained in the facet listing of each search result for processing and visualization with tools like Excel. Statistical data like this is significant in handling large masses of data, as it can reveal aspects that would remain hidden when examining individual documents. For example, it may be relevant to a researcher in what era and with whom Lönnrot primarily discussed a given theme. The statistical data of the facet search readily reveals such information, while compiling such statistics by manually going through thousands of letters would be an impossibly long process.

The easy availability of data in Elias Lönnrot Letters Online will hopefully foster collaboration and enrich research in general. The SKS is already collaborating with Finn-Clarin and the Language Bank, which have received the XML/TEI5 files. As Lönnrot's letters form an exceptionally large collection of manuscripts written by one hand, a section of the letters together with their transcriptions was given to the international READ project, which is working to develop machine recognition of old handwritten texts. A third collaborating partner is the project "STRATAS – Intefacing structured and unstructured data in sociolinguistic research on language change".

Keravuori-Elias Lönnrot Letters Online-276.pdf

Late-Breaking Work

KuKa Digi -project

Tiina H. Airaksinen, Anna-Leena Korpijärvi

University of Helsinki

This poster presents a sample of the Cultural Studies BA program’s Digital Leap project called KuKa Digi. The Digital Leap is a university wide project that aims to support digitalization in both learning and teaching in the new degree programs at the University of Helsinki. For more information on the University of Helsinki’s Digital Leap program, please refer to: . The new Bachelor’s Program in Cultural Studies, was among the projects selected for the 2018-2019 round of the Digital Leap. The primary goal of the KuKa Digi project is to produce meaningful digital material for both teaching and learning purposes. The KuKa Digi project aims to develop the program’s courses, learning environments and materials into a more digital direction. Another goal of the project is to produce an introductory MOOC –course on Cultural Studies for university students, as well as students studying for their A-levels, who may be planning to apply for the Cultural Studies BA program. Finally, we will write a research article to assess the use of digital environments in teaching and learning processes within Cultural Studies BA program. Kuka Digi –project encourages students and teachers to co-operatively plan digital learning environments that are also useful in building up students’ academic portfolio and enhance their working life skills.

The core idea of the project is to create a digital platform or database for teachers, researchers and students in the field of Cultural Studies. Academic networking sites do exist, however they are not without issues. Many of them are either not accessible, or very useful for students, who have not developed their academic careers very far yet. In addition to this, some of these sites are only partially free of charge. The digital platform will act as a place where students, teachers and researchers alike can have the opportunity to network, advertise their expertise and specialization as well as, come into contact with the media, cultural agencies, companies and much more. The general vision for this platform is that it will be user friendly, flexible as well as, act as an “academic Linked In”. The database will be available in Finnish, Swedish and English. The database will include the current students, teachers and experts, who are associated with the program. Furthermore, the platform will include a feature called the digital

portfolio. This will be especially useful for our students, as it is intended to be a digital tool with which they can develop their own expertise within the field of Cultural Studies. Finally, the portfolio will act as a digital business card for the students. The Project poster presented at the conference illustrates the ideas and concepts for the platform in more detail.

For more information on the project and its other goals, please refer to the project blog at:

Airaksinen-KuKa Digi -project-277.pdf

Late-Breaking Work

Topic modelling and qualitative textual analysis

Karoliina Isoaho, Daria Gritsenko

University of Helsinki,

The pursuit of big data is transforming qualitative textual analysis—a laborious activity that has conventionally been executed manually by researchers. Access to data of unprecedented scale and scope has created a need to both analyse large data sets efficiently and react to their emergence in a near-real-time manner (Mills, 2017). As a result, research practices are also changing. A growing number of scholars have experimented with using machine learning as the main or complementary method for text analysis. Even if the most audacious assumptions ‘on the superior forms of intelligence and erudition’ of big data analysis are today critically challenged by qualitative and mixed-method researchers (Mills, 2017: 2), it is imperative for scholars using qualitative methods to consider the role of computational techniques in their research (Janasik, Honkela and Bruun, 2009). Social scientists are especially intrigued by the potential of topic modelling (TM), a machine learning method for big data analysis (Blei, 2012), as a tool for analysis of textual data.

This research contributes to a critical discussion in social science methodologies: how topic modeling can concretely be incorporated into existing processes of qualitative textual analysis and interpretation. Some recent studies paid attention to the methodological dimensions of TM vis-à-vis textual analysis. However, these developments remain sporadic, exemplifying a need for a systematic account of the conditions under which TM can be useful for social scientists engaged in textual analysis. This paper builds upon the existing discussions, and takes a step further by comparing the assumptions, analytical procedures and conventional usage of qualitative textual analysis methods and TM. Our findings show that for content and classification methods, embedding TM into research design can partially and, arguably, in some cases fully automate the analysis. Discourse and representation methods can be augmented with TM in sequential mixed-method research design.

Summing up, we see avenues for TM both in embedded and sequential mixed-method research design. This is in line with previous work on mixed-method research that has challenged the traditional assumption of there being a clear division between qualitative and quantitative methods. Scholarly capacity to craft a robust research design depends on researchers’ familiarity with specific techniques, their epistemological assumptions, and good knowledge of the phenomena that are being investigated to facilitate the substantial interpretation of the results. We expect this research to help identify and address the critical points, thereby assisting researchers in the development of novel mixed-method designs that unlock the potential of TM in qualitative textual analysis without compromising methodological robustness.

Blei, D. M. (2012) ‘Probabilistic topic models’, Communications of the ACM, 55(4), p. 77. Janasik, N., Honkela, T. and Bruun, H. (2009) ‘Text Mining in Qualitative Research’, Organizational Research Methods, 12(3), pp. 436–460.

Mills, K. A. (2017) ‘What are the threats and potentials of big data for qualitative research?’, Qualitative Research, p. 146879411774346.

Isoaho-Topic modelling and qualitative textual analysis-278.pdf

Late-Breaking Work

Local Letters to Newspapers - Digital History Project

Heikki Kokko

University of Tampere, The Centre of Excellence in the History of Experiences (HEX)

The Local Letters to Newspapers is a digital history project of the Academy of Finland Centre of Excellence in the History of Experiences HEX (2018–2025), hosted by University of Tampere. The objective is to make a new kind of digital research material available from the 19th and the early 20th century Finnish society. The aim is to introduce a database of the readers' letters submitted to the Finnish press that could be studied both qualitatively and quantitatively. The database will allow analyzing the 19th and 20th century global reality through a case study of the Finnish society. It will enable a wide range of research topics and open a path to various research approaches, especially the study of human experiences.

Kokko-Local Letters to Newspapers-279.pdf

Late-Breaking Work

Lessons Learned from Historical Pandemics. Using crowdsourcing 2.0 and Citizen Science to map the Spanish Flus spatial and social network.

Søren Poder

Aarhus City Archives

By Søren K. Poder MA. In history & Astrid Lykke Birkving, MA in intellectual History

Aarhus City Archvies | Redia a/s

In 1918 the World was struck by the most devastating disease in recorded history - today known as the Spanish Flu. In less than one year nearly two third of world’s population came down with influenza. Of which between forty and one hundred million people died.

The Spanish Flu in 1918 did not originated in Spain, but most likely on the North American east coast in February 1918. By the middle of Marts, the influenza had spread to most of the overcrowded American army camps from where it soon was carried to the trenches in France and the rest of the World. This part of the story is well known. In contrast the diffusion of the 1918-pandemic, and the seasonal epidemics for that matter, on the regional and local level is still largely obscure. For instance, an explanation on why epidemics evidently tends to follow significantly different paths in different urban areas that otherwise seems to share a common social, commercial and cultural profile, tend to be more theoretical then based on evidence. For one sole reason – the lack of adequate data.

As part of the incessantly scientific interest in historical epidemics, the purpose of this research project is to identify the social, economic and cultural preconditions that most likely determines a given type of locality’s ability to spread or halter an epidemic’s hieratical diffusion.

Crowdsourcing 2.0

To meet ends data large amounts of data from a variety of different historical sources as to be collected and linked together. To do this we use traditional crowdsourcing techniques, where volunteers participates in transcribing different historical documents. Death certificates, census, patient charts etc. But just as important does the collected transcription form the base for a text recognition ML module that in time will be able recognize specific entities in a document – persons, placers, diagnoses dates ect.

Late-Breaking Work

Analysing Swedish Parliamentary Voting Data

Jacobo Rouces, Nina Tahmasebi, Lars Borin, Stian Rødven Eide

University of Gothenburg,

We used publicly available data from voting sessions in the Swedish Parliament to represent each member of parliament (MP) as a vector in a space defined by their voting record between the years 2014 and 2017. We then applied matrix factorization techniques that enabled us to find insightful projections of this data. Namely, it allowed the assessment of the level of clustering of MPs according to their party line while at the same time identifying MPs whose voting record is closer to other parties'. It also provided a data-driven multi-dimensional political compass that allows to ascertain similitudes and differences between MPs and political parties. Currently, the axes of the compass are unlabeled and therefore they lack a clear interpretation, but we plan to apply language technology on the parliamentary discussions associated to the voting sessions on order to identify the topics associated to these axis.

Late-Breaking Work

Automated Cognate Discovery in the Context of Low-Resource Sami Languages

Eliel Soisalon-Soininen, Mika Hämäläinen

University of Helsinki

1 Introduction

The goal of our project is to automatically find candidates for etymologically related words, known as cognates, for different Sami languages. At first, we will focus on North Sami, South Sami and Skolt Sami nouns by comparing their inflectional forms with each other. The reason why we look at the inflections is that, in Uralic languages, it is common that there are changes in the word stem when the word is inflected in different cases. When finding cognates, the non-nominative stems might reveal more about a cognate relationship in some cases. For example, the South Sami word for arm, g ̈ıete, is closer to the partitive of the Finnish word k ̈att ̈a than to the nominative form k ̈asi of the same word.

The fact that a great deal of previous work already exists related to etymolo- gies of words in different Sami languages [2, 4, 8] provides us with an interesting test bed for developing our automatic methods. The results can easily be vali- dated against databases such as A ́lgu [1] which incorporates results of different studies in Sami etymology in a machine-readable database.

With the help of a gold corpus, such as A ́lgu, we can perfect our method to function well in the case of the three aforementioned Sami languages. Later, we can expand the set of languages used to other Uralic languages such as Erzya and Moksha. This is achievable as we are basing our method on the data and tools developed in the Giellatekno infrastructure [11] for Uralic languages. Giellatekno has a harmonized set of tools and dictionaries for around 20 different Uralic languages allowing us to bootstrap more languages into our method.

2 Related Work

In historical linguistics, cognate sets have been traditionally identified using the comparative method, the manual identification of systematic sound corre- spondences across words in pairs of languages. Along with the rapid increase in digitally available language data, computational approaches to automate this process have become increasingly attractive.

Computationally, automatic cognate identification can be considered a prob- lem of clustering similar strings together, according to pairwise similarity scores given by some distance metric. Another approach to the problem is pairwise classification of word pairs as cognates or non-cognates. Examples of common distance metrics for string comparison include edit distance, longest common subsequence, and Dice coefficient.

The string edit distance is often used as a baseline for word comparison, measuring word similarity simply as the amount of character or phoneme in- sertions, deletions, and substitutions required to make one word equivalent to the other. However, in language change, certain sound correspondences are more likely than others. Several methods rely on such linguistic knowledge by convert- ing sounds into sound classes according to phonetic similarity [?]. For example, [15] consider a pair of words to be cognates when they match in their first two consonant classes.

In addition to such heuristics, a common approach to automatic cognate identification is to use edit distance metrics using weightings based on previ- ously identified regular sound correspondences. Such correspondences can also be learned automatically by aligning the characters of a set of initial cognate pairs [3,7]. In addition to sound correspondences, [14] and [6] also utilise se- mantic information of word pairs, as cognates tend to have similar, though not necessarily equivalent, meaning. Another method heavily reliant on prior lin- guistic knowledge is the LexStat method [9], requiring a sound correspondence matrix, and semantic alignment.

However, in the context of low-resource languages, prior linguistic knowledge such as initial cognate sets, semantic information, or phonetic transcriptions are rarely available. Therefore, cognate identification methods applicable to low- resource languages calls for unsupervised approaches. For example, [10] address this issue by investigating edit distance metrics based on embedding characters into a vector space, where character similarity depends on the set of characters they co-occur with. In addition, [12] investigate several unsupervised approaches such as hidden Markov models and pointwise mutual information, while also combining these with heuristic methods for improved performance.

3 Corpus

The initial plan is to base our method on the nominal XML dictionaries for the three Sami languages available on the Giellatekno infrastructure. Apart from just translations, these dictionaries contain also additional lexical information to a varying degree. The additional information which might benefit our re- search goals are cognate relationships, semantic tags, morphological information, derivation and example sentences.

For each noun the noun dictionaries, we produce a list of all its inflections in different grammatical numbers and cases. This is done by using a Python library called Uralic NLP [5], specialized in NLP for Uralic languages. Uralic NLP uses FSTs (finite-state-transducers) from the Giellatekno infrastructure to produce the different morphological forms.

We are also considering a possibility of including larger text corpora in these languages as a part of our method for finding cognates. However, theses languages

have notoriously small corpora available, which might render them insufficient for our purposes.

4 Future Work

Our research is currently at its early stages. The immediate future task is to start implementing different methods based on the previous research to solve the problem. We will first start with edit distance approaches to see what kind of information those can reveal and move towards a more complex solution from there.

A longer-term future plan is to include more languages into the research. We are also interested in a collaboration with linguists who could take a more qualitative look at the cognates found by our method. This will nourish inter- disciplinary collaboration and exchange of ideas between scholars of different backgrounds.

We are also committed to releasing the results produced by our method to a wider audience to use and profit from. This will be done by including the results as a part of the XML dictionaries in the Giellatekno infrastructure and also by releasing them in an open-access MediaWiki based dictionary for Uralic languages [13] developed in the University of Helsinki.


1. A ́lgu-tietokanta. saamelaiskielten etymologinen tietokanta (Nov 2006),

2. Aikio, A.: The Saami loanwords in Finnish and Karelian. Ph.D. thesis, University of Oulu, Faculty of Humanities (2009)

3. Ciobanu, A.M., Dinu, L.P.: Automatic detection of cognates using orthographic alignment. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). vol. 2, pp. 99–105 (2014)

4. Ha ̈kkinen, K.: Suomen kirjakielen saamelaiset lainat. Teoksessa Sa ́mit, sa ́nit, sa ́tneha ́mit. Riepmoˇca ́la Pekka Sammallahtii miessema ́nu 21, 161–182 (2007)

5. Ha ̈ma ̈la ̈inen, M.: UralicNLP (Jan 2018),, doi: 10.5281/zenodo.1143638

6. Hauer, B., Kondrak, G.: Clustering semantically equivalent words into cognate sets in multilingual lists. In: Proceedings of 5th international joint conference on natural language processing. pp. 865–873 (2011)

7. Kondrak, G.: Identification of cognates and recurrent sound correspondences in word lists. TAL 50(2), 201–235 (2009)

8. Koponen, E.: Lappische lehnwo ̈rter im finnischen und karelischen. Lapponica et Uralica. 100 Jahre finnisch-ugrischer Unterricht an der Universita ̈t Uppsala. Vortra ̈ge am Jubil ̈aumssymposium 20.–23. April 1994 pp. 83–98 (1996)

9. List,J.M.,Greenhill,S.J.,Gray,R.D.:Thepotentialofautomaticwordcomparison for historical linguistics. PloS one 12(1), e0170046 (2017)

10. McCoy, R.T., Frank, R.: Phonologically informed edit distance algorithms for word alignment with low-resource languages. Proceedings of

11. Moshagen, S.N., Pirinen, T.A., Trosterud, T.: Building an open-source develop- ment infrastructure for language technology projects. In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16. pp. 343–352. No. 85, Linkping University Electronic Press; Linkpings universitet (2013)

12. Rama, T., Wahle, J., Sofroniev, P., Ja ̈ger, G.: Fast and unsupervised methods for multilingual cognate clustering. arXiv preprint arXiv:1702.04938 (2017)

13. Rueter, J., Ha ̈m ̈al ̈ainen, M.: Synchronized mediawiki based analyzer dictionary development. In: Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages. pp. 1–7 (2017)

14. St Arnaud, A., Beck, D., Kondrak, G.: Identifying cognate sets across dictionaries of related languages. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pp. 2519–2528 (2017)

15. Turchin, P., Peiros, I., Murray, G.M.: Analyzing genetic connections between lan- guages by matching consonant classes. Vestnik RGGU. Seriya ”Filologiya. Voprosy yazykovogo rodstva”, (5 (48)) (2010)

Soisalon-Soininen-Automated Cognate Discovery in the Context of Low-Resource Sami Languages-285.pdf

Late-Breaking Work

Dissertations from Uppsala University 1602-1855 on the internet

Anna Cecilia Fredriksson

Uppsala University, Uppsala University Library

At Uppsala University Library, a long-term project is under way which aims at making the dissertations, that is theses, submitted at Uppsala University in 1602-1855 easy to find and read on the Internet. The work includes metadata production, scanning and OCR processing as well as publication of images of the dissertations in full-text searchable pdf files. So far, approximately 3,000 dissertations have been digitized and made accessible on the Internet via the DiVA portal, Uppsala University’s repository for research publications. All in all, there are about 12,000 dissertations of about 20 pages each on average to be scanned. This work is done by hand, due to the age of the material. The project aims to be completed in 2020.

Why did we prioritize dissertations?

Even before the project started, dissertations were valued research material, and the physical dissertations were frequently on loan. Their popularity was primarily due to the fact that generally, studying university dissertations is a great way to study evolvements and changes in society. In the same way as doctoral theses do today, the older dissertations reflect what was going on in the country, at the University, and in the intellectual Western world on the whole at a certain period of time. The great mass of them makes them especially suitable for comparative and longitudinal studies, and provides excellent chances for scholars to find material little used or not used at all in previous research.

Swedish older dissertations including those of today’s Finland specifically are also comparatively easy to find. In contrast to many other European libraries with an even longer history, collectors published bibliographies of Swedish dissertations as far back as 250 years ago. Our dissertations are also organized, bound and physically easily accessible. Last year the cataloguing of the Uppsala dissertations was completed according to modern standards in LIBRIS. That made them searchable according to subject and word in title, which was not possible before. All this made the digitization process smoother than that of many other kinds of cultural heritage material. The digital publication of the dissertations naturally made access to them even easier for University staff and students as well as lifelong learners in Sweden and abroad.

How are the dissertations used today?

In actual research today, we see that the material is frequently consulted in all fields of history. Dissertations provide scholars in the fields of history of ideas and history of science with insight into the status of a certain subject matter in Sweden in various periods of time, often in relation to the contemporary discussion on the European continent. The same goes for studies in history of literature and history of religion. Many of the dissertations examine subjects that remain part of the public debate today, and are therefore of interest for scholars in the political and social sciences. The languages of the dissertations are studied by scholars of Semitic, Classical and Scandinavian languages, and the dissertations often contain the very first editions and translations of certain ancient manuscripts in Arabic and Runic script. There is also a social dimension of the dissertations worthy of attention, as dedications and gratulatory poems in the dissertations mirror social networks in the educated stratum of Sweden in various periods of time. Illustrations in the dissertations were often made by local artists or the students themselves, and the great mass of gratulatory poems mirrors the less well-known side of poetry in early modern Sweden.

Our users

The users of the physical items are primarily university scholars, primarily our own University, but there is also quite a great deal of interest from abroad. Not least from our neighboring country Finland and from the Baltic States, which were for some time within the Swedish realm. Many projects are going on right now which include our dissertations as research material or which have them as their primary source material; Swedish projects as well as international. As Sweden as a part of learned Europe more or less shared the values, objects and methods of the Western academic world as a whole, to study Swedish science and scholarship is to study an important part of Western science and scholarship.

As for who uses our digital dissertations, we in fact do not know. The great majority of the dissertations are written in Latin, as in all countries of Europe and North America, Latin was the vehicle for academic discussion in the early modern age. In the first half of the 19th century, Swedish became more common in the Uppsala dissertations. Among the ones digitized and published so far, a great deal are in Swedish. As for the Latin ones, they too are clearly much used. Although knowledge of Latin is quite unusual in Sweden, foreign scholars in the various fields of history often had Latin as part of their curriculum. Obviously, our users know at least enough Latin to recognize if a passage treats the topic of their interest. They can also identify which documents are important to them and extract the most important information from it. If the document is central, it is possible to hire a translator.

But we believe that we also reach out to the lifelong learners, or the so-called “ordinary people”. The older dissertations examine every conceivable subject and they offer pleasant reading even for non-specialists, or people who use the Internet for genealogical research. The full text publication makes the dissertation show up, perhaps unexpectedly, when a person is looking for a certain topic or a certain word. Whoever the users the digital publication of the dissertations has been well received, and far beyond expectations. The first three test years of approximately 2,500 digitized dissertations published resulted in close to one million visits and over 170,000 downloads, i.e. over 4,700 per month. Even if we don’t – or perhaps because we don’t – either offer or demand advanced technologies for the use of these dissertations.

The digital publication and the new possibilities for research

The database in which the dissertations are stored and presented is the same database in which researchers, scholars and students of Uppsala University, and other Swedish universities, too, currently register their publications with the option to publish them digitally. This clears a path for new possibilities for researchers to become aware of and study the texts. Most importantly, it enables users to find documents in their field, spanning a period of 400 years in one search session. A great deal of the medical terms of diseases and body parts, chemical designations, and, of course, juridical and botanical terms are Latin and the same as were used 400 years ago, and can thus be used for localizing text passages on these topics. But the form of the text can be studied, too. Linguists would find it useful to make quantitative studies of the use of certain words or expressions, or just to find the words of interest for further studies. The usefulness of full-text databases are all known to us. But often one as a user gets either a well-working search system or a great mass of important texts, and seldom both. This problem is solved here by the interconnection between the publication database DiVA and the Swedish National Research Library System LIBRIS. The combination makes it possible to use an advanced search system with high functionality, thus reducing the Internet problem of too many irrelevant hits. It gives direct access to the digital full text in DiVA, and the option to order the physical book if the scholar needs to see the original at our library. Not least important, there is qualified staff appointed to care for the system’s long-term maintenance and updates, as part of their everyday tasks at the University Library. Also, the library is open for discussion with users.

The practical work within the project and related issues

As part of the digitization project, the images of the text pages are OCR-processed in order to create searchable full-text pdf files. The OCR process gives various results depending on the age and the language of the text. The OCR processing of dissertations in Swedish and Latin from ca. 1800 onwards results in OCR texts with a high degree of accuracy, that is, between 80 and 90 per cent, whereas older dissertations in Latin and in languages written in other alphabets will contain more inaccuracies. On this point we are not satisfied. Almost perfect results when it comes to the OCR-read text, or proof-reading, is a basic requirement for the full use and potential of this material. However, in this respect, we are dependent upon the technology which is available on the market, as this provides the best and safest product. These products were not developed for handling printing types of various sorts and sizes from the 17th and 18th centuries, and the development of these techniques, except when it comes to “Fraktur”, is slow or non-existing.

If you want to pursue further studies of the documents, you can download the documents for free to your own computer. There are free programs on the Internet that help you merge several documents of your choice into one document, in order for you to be able to search through a certain mass of text. If you are searching for something very particular, you could of course also perform a word search in Google. One of our wishes for the future is to make it possible for our users to search in several documents of their specific choice at one time, without them having to download the documents to their computer.

So, most important for us today within the dissertation project:

1) Better OCR for older texts

2) Easier ways to search in a large text mass of your own choice.

Future use and collaboration with scholars and researchers

The development of digital techniques for the further use of these texts is a future desideratum. We therefore aim to increase our collaboration with researchers who want to explore new methods to make more out of the texts. However, we always have to take into account the special demands from society when it comes to the work we, as an institute of the state, are conducting – in contrast to the work conducted by e.g. Google Books or research projects with temporary funding.

We are expected to produce both images and metadata of a reasonably high quality – a product that the University can ‘stand for’. What we produce should have a lasting value – and ideally be possible to use for centuries to come.

What we produce should be compatible with other existing retrieval systems and library systems. Important, in my opinion, is reliability and citability. A great problem with research on digitally borne material is, in my opinion, that it constantly changes, with respect to both their contents and where to find them. This puts the fundamental principle of modern science, the possibility to control results, out of the running. This is a challenge for Digital Humanities which, with the current pace of development, surely will be solved in the near future.

Fredriksson-Dissertations from Uppsala University 1602-1855 on the internet-282.pdf

Late-Breaking Work

Normalizing Early English Letters for Neologism Retrieval

Mika Hämäläinen, Tanja Säily, Eetu Mäkelä

University of Helsinki


Our project studies social aspects of innovative vocabulary use in early English letters. In this abstract we describe the current state of our method for detecting neologisms. The problem we are facing at the moment is the fact that our corpus consists of non-normalized text. Therefore, spelling normalization is the first step we need to solve before we can apply automatic methods to the whole corpus.


We use CEEC (Corpora of Early English Correspondence) [9] as the corpus for our research. The corpus consists of letters ranging from the 15th century to the 19th century and it represents a wide social spectrum, richly documented in the metadata associated with the corpus, including information on e.g. socioeconomic status, gender, age, domicile and the relationship between the writer and recipient.

Finding Neologisms

In order to find neologisms, we use the information of the earliest attestation of words recorded in the Oxford English Dictionary (OED) [10]. Each lemma in the OED has information about its attestations, but also variant spelling forms and inflections.

How we proceed in automatically finding neologism candidates is as follows. We get a list of all the individual words in the corpus, and we retrieve their earliest attestation from the OED. If we find a letter where the word has been used before the earliest attestation recorded in the OED, we are dealing with a possible neologism, such as the word "monotonous" in (1), which antedates the first attestation date given in the OED by two years (1774 vs. 1776).

(1) How I shall accent & express, after having been so long cramped with the monotonous impotence of a harpsichord! (Thomas Twining to Charles Burney, 1774; TWINING_017)

The problem, however, is that our corpus consists of texts written in different time periods, which means that there is a wide range of alternative spellings for words. Therefore, a great part of the corpus cannot be directly mapped to the OED.

Normalizing with the Existing Methods

Part of the CEEC (from the 16th century onwards) has been normalized with VARD2 [3] in a semi-automated manner; however, the automatic normalization is only applied to sufficiently frequent words, whereas neologisms are often rare words. We take these normalizations and extrapolate them over the whole corpus. We also used MorphAdorner [5] to produce normalizations for the words in the corpus. After this, we compared the newly normalized forms with those in the OED taking into account the variant forms listed in the OED. NLTK's [4] lemmatizer was used to produce lemmas from the normalized inflected forms to map them to the OED. In doing so, we were able to map 65,848 word forms of the corpus to the OED. However, around 85,362 word forms still remain without mapping to the OED.

Different Approaches

For the remaining non-normalized words, we have tried a number of different approaches.

- Rules



- Edit distance, semantics and pronunciation

The simplest one of them is running the hand-written VARD2 normalization rules for the whole corpus. These are simple replacement rules that replace a sequence of characters with another one either in the beginning, end or middle of a word. An example of such a rule is replacing "yes" with "ies" at the end of the word.

We have also trained a statistical machine translation model (with Moses [7]}) and a neural machine translation model (with OpenNMT [6]). SMT has previously been used in the normalization task, for example in [11]. Both of the models are character based treating the known non-normalized to normalized word pairs as two languages for the translation model. The language model used for the SMT model is the British National Corpus (BNC) [1].

One more approach we have tried is to compare the non-normalized words to the ones in the BNC by Levenshtein edit distance [8]. This results in long lists of normalization candidates, that we filter further by their semantic similarity, which means comparing the list of two word appearing immediately after and before the non-normalized word and the normalization candidates picking out the candidates with largest number of shared contextual words. And finally, filtering this list with Soundex pronunciation by edit distance. A similar method [2] has been used in the past for normalization which relied on the semantics and edit distance.

The Open Question

The above described methods produce results of varying degrees of success. However, none of them is reliable enough to be trusted above the rest. We are now in a situation in which at least one of the approaches finds the correct normalization most of the time. The next unsolved question is how to pick the correct normalization from the list of alternatives in an accurate way.

Once the normalization has been solved, we are facing another problem which is mapping words to the OED correctly. For example, currently the verb "to moon" is mapped to the noun "mooning" recorded in the OED because it appeared in the present participle form in the corpus. This means that in the future, we have to come up with ways to tackle not only the problem of homonyms, but also the problem of polysemy. A word might have acquired a new meaning in one of our letters, but we cannot detect this word as a neologism candidate, because the word has existed in the language in a different meaning before.


1. The British National Corpus, version 3 (BNC XML Edition). Distributed by Bodleian Libraries, University of Oxford, on behalf of the BNC Consortium (2007),

2. Amoia, M., Martinez, J.M.: Using comparable collections of historical texts forbuilding a diachronic dictionary for spelling normalization. In: Proceedings of the7th workshop on language technology for cultural heritage, social sciences, andhumanities. pp. 84–89 (2013)

3. Baron, A., Rayson, P.: VARD2: a tool for dealing with spelling variation in histor-ical corpora (2008)

4. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’ReillyMedia (2009)

5. Burns, P.R.: Morphadorner v2: A java library for the morphological adornment ofEnglish language texts. Northwestern University, Evanston, IL (2013)

6. Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.M.: OpenNMT: Open-SourceToolkit for Neural Machine Translation. ArXiv e-prints

7. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N.,Cowan, B., Shen, W., Moran, C., Zens, R., et al.: Moses: Open source toolkit forstatistical machine translation. In: Proceedings of the 45th annual meeting of theACL on interactive poster and demonstration sessions. pp. 177–180. Associationfor Computational Linguistics (2007)

8. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, andreversals. In: Soviet physics doklady. vol. 10, pp. 707–710 (1966)

9. Nevalainen,T.,Raumolin-Brunberg,H.,Ker ̈anen,J.,Nevala,M.,Nurmi, A., Palander-Collin, M.: CEEC, Corpus of Early English Cor-respondence. Department of Modern Languages, University of Helsinki,

10. OED: OED Online. Oxford University Press,

11. Pettersson, E., Megyesi, B., Tiedemann, J.: An SMT approach to automatic an-notation of historical text. In: Proceedings of the workshop on computational his-torical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALTProceedings Series 18. pp. 54–69. No. 087, Link ̈oping University Electronic Press(2013)

Hämäläinen-Normalizing Early English Letters for Neologism Retrieval-283.pdf

Late-Breaking Work

Triadic closure amplifies homophily in social networks

Aili Asikainen1, Gerardo Iñiguez2, Kimmo Kaski1, Mikko Kivelä1

1Aalto University, Finland; 2Next Games, Finland

Much of the structure in social networks can be explained by two seemingly separate network evolution mechanisms: triadic closure and homophily. While it is typical to analyse these mechanisms separately, empirical studies suggest that their dynamic interplay can be responsible for the striking homophily patterns seen in real social networks. By defining a network model with tunable amount of homophily and triadic closure, we find that their interplay produces a myriad of effects such as amplification of latent homophily and memory in social networks (hysteresis). We use empirical network datasets to estimate how much observed homophily could actually be an amplification induced by triadic closure, and have the networks reached a stable state in terms of their homophily. Beyond their role in characterizing the origins of homophily, our results may be useful in determining the processes by which structural constraints and personal preferences determine the shape and evolution of society.

Asikainen-Triadic closure amplifies homophily in social networks-281.pdf
2:30pm - 3:45pmPlenary 4: Frans Mäyrä
Session Chair: Eetu Mäkelä
Game Culture Studies as Multidisciplinary (Digital) Cultural Studies. Watchable also remotely from PII, PIV and P674.
Think Corner 
4:00pm - 5:30pmF-PII-2: Computational Linguistics 2
Session Chair: Risto Vilkko
4:00pm - 4:30pm
Long Paper (20+10min) [publication ready]

Verifying the Consistency of the Digitized Indo-European Sound Law System Generating the Data of the 120 Most Archaic Languages from Proto-Indo-European

Jouna Pyysalo1, Mans Hulden2, Aleksi Sahala1

1University of Helsinki,; 2University of Colorado Boulder

Using state-of-the-art finite-state technology (FST) we automatically generate data of the some 120 most archaic Indo-European (IE) languages from reconstructed Proto-Indo-European (PIE) by means of digitized sound laws. The accuracy rate of the automatic generation of the data exceeds 99%, which also applies in the generation of new data that were not observed when the rules

representing the sound laws were originally compiled. After testing and verifying the consistency of the sound law system with regard to the IE data and the PIE reconstruction, we report the following results:

a) The consistency of the digitized sound law system generating the data of the 120 most archaic Indo-European languages from Proto-Indo-European is verifiable.

b) The primary objective of Indo-European linguistics, a reconstruction theory of PIE in essence equivalent to the IE data (except for a limited set of open research problems), has been provably achieved.

The results are fully explicit, repeatable, and verifiable.

Pyysalo-Verifying the Consistency of the Digitized Indo-European Sound Law System Generating the Data of the_a.pdf

4:30pm - 4:45pm
Short Paper (10+5min) [publication ready]

Towards Topic Modeling Swedish Housing Policies: Using Linguistically Informed Topic Modeling to Explore Public Discourse

Anna Lindahl1, Love Börjeson2

1Gothenburg university; 2Graduate School of Education, Stanford University

This study examines how one can apply the method topic modeling to explore the public discourse of Swedish housing policies, as represented by documents from the Swedish parliament and Swedish newstexts. This area is relevant to study because of the current housing crisis in Sweden.

Topic modeling is an unsupervised method for finding topics in large collections of data and this makes it suitable for examining public discourse. However, in most studies which employ topic modeling there is a lack of using linguistic information when preprocessing the data. Therefore, this work also investigates what effect linguistically informed preprocessing has on topic modeling.Through human evaluation, filtering the data based on part of speech is found to have the largest effect on topic quality. Non-lemmatized topics are found to be rated higher than lemmatized topics. Topics from the filters based on dependency relations are found to have low ratings.

Lindahl-Towards Topic Modeling Swedish Housing Policies-256_a.pdf

4:45pm - 5:00pm
Short Paper (10+5min) [abstract]

Embedded words in the historiography of technology and industry, 1931–2016

Johan Jarlbrink, Roger Mähler

University of Umeå, Sweden

From 1931 to 2016 The Swedish National Museum of Science and Technology published a yearbook, Dædalus. The 86 volumes display a great diversity of industrial heritage and cultures of technology. The first volumes were centered on the heavy industry, such as mining and paper plants located in North and Mid-Sweden. The last volumes were dedicated to technologies and products in people’s everyday lives – lipsticks, microwave ovens, and skateboards. During the years Dædalus has covered topics reaching from individual inventors to world fairs, media technologies from print to computers, and agricultural developments from ancient farming tools to modern DNA analysis. The yearbook presents the history of industry, technology and science, but can also be read as a historiographical source reflecting shifting approaches to history over an 80-year period. Dædalus was recently digitized and can now be analyzed with the help of digital methods.

The aim of this paper is twofold: To explore the possibilities of word embedding models within a humanities framework, and to examine the Dædalus yearbook as a historiographical source with such a model. What we will present is work in progress with no definitive findings to show at the time of writing. Yet, we have a general idea of what we would like to accomplish. Analyzing the yearbook as a historiographical source means that we are interested in what kinds of histories it represents, its focus and bias. We follow Ben Schmidt’s (admittedly simplified) suggestion that word embedding models for textual analysis can be viewed and used as supervised topic model tools (Schmidt, 2015). If words are defined by the distribution of the vocabulary of their contexts we can calculate relations between words and explore fields of related words as well as binary relations in order to analyze their meaning. Simple – and yet fundamental – questions can be asked: What is “technology” in the context of the yearbook? What is “industry”? Of special interest in the case of industrial and technological history are binaries such as rural/urban, man/woman, industry/handicraft, production/consumption, and nature/culture. Which words are close to “man”, and which are close to “woman”? Which aspects of the history of technology and industry are related to “production” and which are related to “consumption”?

Word embedding is a comparatively new set of tools and techniques within data science (NLP) with that in common that the words in a vocabulary of a corpus (or several corpora) are assigned numerical representations through some (of a wide variety of different) computation. In most cases, this comes down to not only mapping the words to numerical vectors, but doing so in such a way that the numerical values in the vectors reflect the contextual similarities between words. The computations are based on the distributional hypothesis stemming from (Zellig Harris, 1954), implicating that “words which are similar in meaning occur in similar contexts” (Rubenstein & Goodenough, 1965). The words are embedded (positioned) in a high-dimensional space, each word represented by a vector in the space i.e. a simple representational model based on linear algebra. The dimension of the space is defined by the size of the vectors and the similarity between words then become a matter of computing the difference between vectors in this space, for instance the difference in (euclidian) distance or difference in direction between the vectors (cosine similarity). Within vector space models the former is the most popular under the assumption that related words tend to have similar directions. The arguably most prominent and popular of these algorithms, and the one that we have used, is the skip-gram model Word2Vec (Mikolov et al, 2013). In short, this model uses a neural network to compute the word vectors as results from training the network to predict the probabilities of all the words in a vocabulary being nearby (as defined by a window size) a certain word in focus.

An early evaluation shows that the model works fine. Standard calculations often used to evaluate the performance and accuracy indicates that we have implemented the model correctly – we can indeed get the correct answers to equations such as “Paris - France + Italy = Rome” (Mikolov et al, 2013). In our case we were looking for “most_similar(positive=['sverige','oslo'], negative=['stockholm'])”. And the “most similar” was “norge”. We have also explored simple word similarity in order to evaluate the model and get a better understanding of our corpus. What remains to be done is to identify relevant words (or group of words) that can be used when we are examining “topics” and binary dimensions in the corpus. We are also experimenting with different ways to cluster and visualize the data. Although some work remains to be done, we will definitely have results to present at the time of the conference.

Harris, Zellig (1954). Distributional structure. Word, 10(23):146–162.

Mikolov, Tomas, Chen, Kai, Corrado, Greg & Dean, Jeffrey (2013). Efficient estimation of word representations in vector space. CoRR, abs/1301.3781

Rubenstein, Herbert & Goodenough, John (1965). Contextual Correlates of Synonymy. Communications of the ACM, 8(10): 627-633.

Schmidt, Ben (2015). Word Embeddings for the digital humanities. Blog post at

Jarlbrink-Embedded words in the historiography of technology and industry, 1931–2016-223_a.pdf

5:00pm - 5:15pm
Short Paper (10+5min) [abstract]

Revisiting the authorship of Henry VIII’s Assertio septem sacramentorum through computational authorship attribution

Marjo Kaartinen, Aleksi Vesanto, Anni Hella

University of Turku

Undoubtedly, one of the great unsolved mysteries of Tudor history through centuries has been the authorship of Henry VIII’s famous treatise Assertio septem sacramentorum adversus Martinum Lutherum (1521). The question of its authorship intrigued the contemporaries already in the 1520s. With Assertio, Henry VIII gained from the Pope the title Defender of the Faith which the British monarchs still use. Because of the exceptional importance of the text, the question of its authorship is not irrelevant in the study of history.

For various reasons and motivations each of their own, many doubted the king’s authorship. The discussion has continued to the present day. A number of possible authors have been named, Thomas More and John Fisher foremost among them. There is no clear consensus about the authorship in general – nor is there a clear agreement upon the extent of the King’s role in the writing process in the cases where joint authorship is suggested. The most commonly shared conclusion indeed is that the King was more or less helped in the writing process and that the authorship of the work was thus shared at least to some degree: that is, even if Henry VIII was active in the writing of Assertio, he was not the sole author but was helped by someone or by a group of theological scholars.

In the case of Assertio, The Academy of Finland funded consortium Profiling Premodern Authors (PROPREAU) has tackled the difficult Latin source situation and put an effort into developing more efficient machine learning methods for authorship attribution in a case where large training corpora are not available. This paper will present the latest discoveries in the development of such tools and will report on the results. These will give historians tools for opening a myriad of questions we have been hitherto unable to answer. It is of great significance for the whole discipline of history to be able to name authors to texts that are anonymous or of disputed origin.

Select Bibliography:

Betteridge, Thomas: Writing Faith and Telling Tales: Literature, Politics, and Religion in the Work of Thomas More. University of Notre Dame Press 2013.

Brown, J. Mainwaring: Henry VIII.’s Book, “Assertio Septem Sacramentorum,” and the Royal Title of “Defender of the Faith”. Transactions of the Royal Historical Society 1880, 243–261.

Nitti, Silvana: Auctoritas: l’Assertio di Enrico VIII contro Lutero. Studi e testi del Rinascimento europeo. Edizioni di storia e letteratura 2005.

Kaartinen-Revisiting the authorship of Henry VIII’s Assertio septem sacramentorum through computational a_a.docx
4:00pm - 5:30pmF-PIV-2: Digital History
Session Chair: Mikko Tolonen
4:00pm - 4:30pm
Long Paper (20+10min) [publication ready]

Historical Networks and Identity Formation: Digital Representation of Statistical and Geo- Data to Mobilize Knowledge. Case Study of Norwegian Migration to the USA (1870-1920)

Jana Sverdljuk

National Library of Norway,

The article is a result of the collaborative interdisciplinary workshop, which involved expertise from social sciences, history and digital humanities. It showed how computer mediated ways of researching historical networks and identity formation of Norwegian-Americans substantially complemented historical and social sciences methods. By using open API of the National Archives of Norway we used statistical, geo- and text data to produce an interactive temporal visualization of regional origins in Norway at the USA map. Spatial visualization allowed highlighting space and time and the changing regional belonging as fundamental values for understanding social and cultural dimensions of migrants’ lives. We claim that data visualizations of space and time have performative materiality (Drucker 2013). They open a free room for a researcher to come up with his/her own narrative about the studied phenomenon (Perez and Granger 2015). Visualizations make us reflect on the relationship between the phenomenon and its representation (Klein 2014). This digital method supplements the classical sociological and socio-constructivist methods and has therefore knowledge mobilizing effects. In the article, we show, what potentials this visualization has in relation to the particular field of emigration studies, when entering into a dialogue with the existing historical research in the field.

Sverdljuk-Historical Networks and Identity Formation-155_a.pdf

4:30pm - 4:45pm
Short Paper (10+5min) [abstract]

Spheres of “public” in eighteenth-century Britain

Mark J. Hill1, Antti Kanner1, Jani Marjanen1, Ville Vaara1, Eetu Mäkelä1, Leo Lahti2, Mikko Tolonen1

1University of Helsinki; 2University of Turku

The eighteenth-century saw a transformation in the practices of public discourse. With the emergence of clubs, associations, and, in particular, coffee houses, civic exchange intensified from the late seventeenth century. At the same time print media was transformed: book printing proliferated; new genres emerged (especially novels and small histories); works printed in smaller formats made reading more convenient (including in public); and periodicals - generally printed onto single folio half-sheets - emerged as a separate category of printed work which was written specifically for public consumption, and with the intention of influencing public discourse (such periodicals were intended to be both ephemeral and shared, often read, and then discussed, publically each day). This paper studies how these changes may be recognized in language by quantitatively studying the word “public” and its semantic context in the Eighteenth-Century Collections Online (ECCO).

While there are many descriptions of the transformation of public discourse (both contemporary and historical), there has been limited research into the language revolving (and evolving) around “public” in the eighteenth-century. Jürgen Habermas (2003: 2-3) famously argues that the emergence of words such as “Öffentlichkeit” in German and “publicity” in English are indicative of a change in the public sphere more generally. The conceptual history of “Öffentlichkeit” has been further studied in depth by Lucian Hölscher (1978), but a systematic study of the semantic context of “public” in British eighteenth-century material is missing. Studies that have covered this topic, such as Gunn (1989), base their findings on a very limited set of source material. In contrast, this study, by using a large-scale digitized corpus, aims to supplement earlier studies that focus on individual speech acts or particular collections of sources, and provide a more comprehensive account of how the language of “public” changed in the eighteenth century.

The historical subject matter means that the study is based on the ECCO corpus. While ECCO is in many ways an invaluable resource, a key goal of this study is to be methodologically sound from the perspective of corpus-linguistics and intellectual history, while developing insights which are relevant more generally to sociologists and historians. In this regard, ECCO does come with its own particular problems: both in terms of content and size.

With regard to content: OCR mistakes remain problematic; its heterogeneity in genres can skew investigations; and the unpredictable nature of duplicate texts introduced by numerous reprints of certain volumes must be taken into account. However, many of these problems can be mitigated in different ways. For example, in specific cases we compare findings with the, much smaller, ECCO TCP (an OCR corrected subset of ECCO). We have further used the English Short Title Catalogue (ESTC) to connect textual findings with relevant metadata information contained in the catalogue. By merging ESTC metadata with ECCO, one can more easily use existing historical knowledge (for example, issues around reprints and multiple editions) to engage with the corpus.

With regard to size: the corpus itself is too big to run automatic parsers. We have therefore extracted a separate, and smaller, corpus (with the help of ESTC metadata) to do more complex and demanding analyses. Results of these analyses were then replicated in a much simpler and cruder form on the whole dataset to gauge whether results corroborate the initial observations.

The size constraints provide their own advantages, however. The smaller subsections were chosen to represent pamphlets and other similar short documents by extracting all documents with less than 10406 characters in them. Compared to other specific genres or text types, this proved to be a successful method when attempting to define a meaningful subcorpus, while at the same time limiting effects of reprints, and including a relatively large number of individual writers in the analysis. The subjects covered by pamphlets also tend to be quite historically topical, and as shorter texts, inspecting single occurrences in their original context is much more efficient as things such as main theme, context, and writer’s intentions reveal themselves comparatively quickly compared to larger works. Thus, issues around distant and close reading are more easily overcome. In addition, we are able to compare semantic change between the larger corpus and the more rapidly shifting topical and political debates found in pamphlets, which offers its own historical insights.

In terms of specific linguistic approaches, analysis started with examinations of contextual distributions of “public” by year. Then, by changing the parameters of this analysis (for example, by defining the context as a set of syntactic dependencies engaged by public, or as collocation structures of a wider lexical environment) different aspects of the use of “public” can be brought to the foreground.

As syntactic constraints govern possibilities of combinations of words in shorter ranges of context, the narrower context windows contain a lot of syntactic information in addition to collocational information. Because of this syntactic restrictedness of close range combinations, the semantic relatedness of words with similar short range context distributions is one of degree of mutual interchangeability and, as such, of metaphorical relatedness (Heylen, Peirsman, Geeraerts, Speelman 2008). Wider context windows, such as paragraphs, are free from syntactic constraints, and so semantic relatedness between two words with similar wide range context distributions carries information from frequent contiguity in context and can be described as more metonymical than metaphorical by nature, as is visible from applications based on term-document-matrices, such as topic modelling or Latent Semantic Analysis (cf. Blei, Ng and Jordan (2003) and Dumais (2005))

The syntactic dependencies were counted by analysing the pamphlet subcorpus using Stanford Lexical Parser (Cheng and Manning 2014). Results show changes in the tendency to use “public” as an adjective attribute and in compound positions. Since in English the overwhelmingly most frequent position for both adjective attributes and compounding attributes is preceding head words, this analysis could be adequately replicated using bigrams in the whole dataset. Lexical environments have been analysed by clustering second order collocations (cf. Bertels and Speelman (2014)) and replicated by using a random sampling from the whole dataset to produce the second order vectors.

The study of all bigrams relating to “public” (such as “public opinion”, “public finances”, “public religion”) in ECCO provides for a broader analysis of the use of “public” in eighteenth-century discourse that not only focuses on particular compounds, but provides a better idea of which domains “public” was used in. It points towards a declining trend in relative frequency of religious bigrams during the course of the eighteenth century and rise in the relative frequency of secular bigrams - both political and economic. This allows us to present three arguments: First, it is argued that this is indicative of an overall shift in the language around “public” as the concept’s focus changed and it began to be used in new domains. This expansion of discourses or domains in which “public” was used is confirmed in the analyses of a wider lexical environment. Second, we also notice that some collocates to public, such as “public opinion” and “public good”, gained a stronger rhetorical appeal. They became tropes in their own right and gained a future orientation in political discourse in the latter half of the eighteenth century (Koselleck 1972). Third, by combining the results of the distributional semantics of “public” in ECCO with information extracted from ESTC, one can recognize how different groups used the language relating to “public” in different ways. For example, authors writing on religious topics tended to use “public” differently from authors associated with the enlightenment in Scotland or France.

There are two important upshots to this study: the methodological and the historical. With regard to the former, the paper works as a convincing case study which could be used as an example, or workflow, for studying other words that are pivotal to large structural change. With regard to the latter, the work is of particular historical relevance to recent discussions in eighteenth century intellectual history. In particular, the study contributes to the critical discussion of Habermas that has been taking place in the English-speaking world since the translation of his Structural Transformation of the Public Sphere in 1989, while also informing more traditional historical analyses which have not been able to draw tools from the digital humanities (Hill 2017).


Bertels, Ann and Dirk Speelman (2014). “Clustering for semantic purposes. Exploration of semantic similarity in a technical corpus.” Terminology 20:2, pp. 279–303. John Benjamins Publishing Company.

Blei, David, Andrew Y. Ng and Michael I. Jordan (2003). “Latent Dirichlecht Allocation.” Journal of Machine Learning Research 3 (4–5). Pp. 993–1022.

Chen, Danqi and Christopher D Manning (2014). “A Fast and Accurate Dependency Parser using Neural Networks.” Proceedings of EMNLP 2014.

Dumais, Susan T. (2005). Latent Semantic Analysis. Annual Review of Information Science and Technology. 38: 188–230.

Gunn, J.A.W. (1989). “Public opinion.’ Political Innovation and Conceptual Change (Edited by Terence Ball, James Farr & Rusell L. Hanson). Cambridge: Cambridge University Press.

Habermas, Jürgen (2003 [1962]). The Structural Transformation of the Public Sphere: An Inquiry into a Category of Bourgeois Society. Cambridge: Polity.

Heylen, Christopher, Yves Peirsman, Dirk Geeraerts and Dirk Speelman (2008). “Modelling Word Similarity: An Evaluation of Automatic Synonymy Extraction Algorithms.” Proceedings of LREC 2008.

Hill, Mark J. (2017), “Invisible interpretations: reflections on the digital humanities and intellectual history.” Global Intellectual History 1.2, pp. 130-150.

Hölscher, Lucian (1978), “‘Öffentlichkeit.’” Otto Brunner et al. (Hrsg.) Geschichtliche Grundbegriffe. Historisches Lexikon zur politisch-sozialen Sprache in Deutschland. Band 4, Stuttgart, Klett-Cotta, pp. 413–467.

Koselleck, Reinhart (1972), “‘Einleitung.’” Otto Brunner, Werner Conze & Reinhart Koselleck (hrsg.), Geschichtliche Grundbegriffe. Historisches Lexikon zur politisch-sozialen Sprache in Deutschland. Band I, Stuttgart, Klett-Cotta, pp. XIII–XXVII.

Hill-Spheres of “public” in eighteenth-century Britain-178_a.pdf

4:45pm - 5:00pm
Short Paper (10+5min) [abstract]

Charting the ’Culture’ of Cultural Treaties: Digital Humanities approaches to the history of international ideas

Benjamin G. Martin

Uppsala University

Cultural treaties are the bi-lateral or sometimes multilateral agreements among states that promote and regulate cooperation and exchange in the fields of life we call cultural or intellectual. Pioneered by France just after World War I, this type of treaty represents a distinctive technology of modern international relations, a tool in the toolkit of public diplomacy, a vector of “soft power.” One goal of a comparative examination of these treaties is to locate them in the history of public diplomacy and in the broader history of culture and power in the international arena. But these treaties can also serve as sources for the study of what the historian David Armitage has called “the intellectual history of the international.” In this project, I use digital humanities methods to approach cultural treaties as a historical source with which to explore the emergence of a global concept of culture in the twentieth century. Specifically, the project will investigate the hypothesis that the culture concept, in contrast to earlier ideas of civilization, played a key role in the consolidation of the post-World War II international order.

I approach the topic by charting how concepts of culture were given form in the system of international treaties between 1919 (when the first such treaty was signed) and 1972 (when UNESCO’s Convention on cultural heritage marked the “arrival” of a global embrace of the culture concept), studying them with the large-scale, quantitative methods of the digital humanities, as well as with the tools of textual and conceptual analysis associated with the study of intellectual history. In my paper for DH Nordic 2018, I will outline the topic, goals, and methods of the project, focusing on the ways we (that is, my colleagues at Umeå University’s HUMlab and I) seek to apply DH approaches to this study of global intellectual history.

The project uses computer-assisted quantitative analysis to analyze and visualize how cultural treaties contributed to the spread of cultural concepts and to the development of transnational cultural networks. We explore the source material offered by these treaties by approaching it as two distinct data sets. First, to chart the emergence of an international system of cultural treaties, we use quantitative analysis of the basic information, or “metadata” (countries, date, topic) from the complete set of treaties on cultural matters between 1919 and 1972, approximately 1250 documents. Our source for this information is the World Treaty Index ( This data can also help identify historical patterns in the emergence of a global network of bilateral cultural treaties. Once mapped, these networks will allow me to pose interesting questions by comparing them to any number of other transnational systems. How, for example, does the map of cultural agreements compare to that of trade treaties, military alliances, or to the transnational flows of cultural goods, capital, or migrants?

Second, to identify the development of concepts, we will observe the changing use of key terms through quantitative analysis of the treaty texts. By treating a large group of cultural treaties as several distinct text corpora and, perhaps, as a single text corpus, we will be able explore the treaties using textometry and topic modeling. The treaty texts (digital versions of most which can be found online) will be limited to four subsets: a) Britain, France, and Italy, 1919-1972; b) India, 1947-1972; c) the German Reich (1919-1945) and the two German successor states (1949-1972); and d) UNESCO’s multilateral conventions (1945-1972). This selection is designed to approach a global perspective while taking into account practical factors, such as language and accessibility. Our use of text analysis seeks (a) to offer insight into the changing usage and meanings of concepts like “culture” and “civilization”; (b) to identify which key areas of cultural activity were regulated by the treaties over time and by world region; and (c) to clarify whether “culture” was used in a broad, anthropological sense, or in a narrower sense to refer to the realm of arts, music, and literature. This aspect of the project raises interesting challenges, for example regarding how best to manipulate a multi-lingual text corpus (with texts in English, French, and German, at least).

In these ways, the project seeks to contribute to our understanding of how the concept of culture that guides today’s international society developed. It also explores how digital tools can help us ask (and eventually answer) questions in the field of global intellectual history.

Martin-Charting the ’Culture’ of Cultural Treaties-139_a.pdf

5:00pm - 5:15pm
Short Paper (10+5min) [abstract]

Facilitating Digital History in Finland: What can we learn from the past?

Mats Fridlund, Mila Oiva, Petri Paju

Aalto University

The paper discusses the findings of “From Roadmap to Roadshow: A collective demonstration & information project to strengthen Finnish digital history” project. The project develops the history disciplines in Finland as a collaborative project. The project received funding from the Kone Foundation. The long paper proposed for the DHN2018 will discuss what we have learned about the present day conditions of digital history in Finland, how digital humanities is facilitated today in Finland and abroad, and what suggestions we could give for strengthening the conditions for doing digital history research in Finland.

At the first phase of the project we did a survey among Finnish historians and identified several critical issues that require further development. They were the following: creating better, up-to-date information channels of digital history resources and events, providing relevant education, skills, and teaching by historians, and the need to help historians and information technology specialists to meet and collaborate better and more systematically than before. Many historians also had issues with the concept of digital history and difficulties with such an identity.

In order to situate Finnish digital history in the domestic and international contexts, we have studied the roots of the computational history research in Finland, which date back to the 1960s, and the best practice of how digital history is currently done internationally. We have visited selected digital humanities centers in Europe and the US, which we have identified as having “done something right”. Based on these studies, visits and interviews we will propose steps to be taken for further strengthen the digital history research community in Finland.

Fridlund-Facilitating Digital History in Finland-220_a.pdf
4:00pm - 5:30pmF-P674-2: Between the Manual and the Automatic
Session Chair: Eero Hyvönen
4:00pm - 4:15pm
Short Paper (10+5min) [publication ready]

In search of Soviet wartime interpreters: triangulating manual and digital archive work

Svetlana Probirskaja

University of Helsinki

This paper demonstrates the methodological stages of searching for Soviet wartime interpreters of Finnish in the digital archival resource of the Russian Ministry of Defence called Pamyat Naroda (Memory of the People) 1941–1945. Since wartime interpreters do not have their own search category in the archive, other means are needed to detect them. The main argument of this paper is that conventional manual work must be done and some preliminary information obtained before entering the digital archive, especially when dealing with a marginal subject such as wartime interpreters.

Probirskaja-In search of Soviet wartime interpreters-142_a.pdf

4:15pm - 4:30pm
Distinguished Short Paper (10+5min) [abstract]

Digital Humanities Meets Literary Studies: the Challenges for Estonian Scholarship

Piret Viires1, Marin Laak2

1Tallinn University; 2Estonian Literary Museum

In recent years, the application of DH as a method of computerised analysis and the extensive digitisation of literary texts, making them accessible as open data and organising them into large text corpora, have made the relations between literature and information technology a hot topic.

New directions in literary history link together literary analysis, computer technology and computational linguistics, offering new possibilities for studying the authors’ style and language, analysing texts and visualising results.

Along such mainstream uses, DH still contain several other important directions for literary studies. The aim of this paper is to check out the limits and possibilities of DH as a concept

and to determine their suitability for literary research in the digital age. Our discussion is based, first, on the twenty-year-long experience of digital representing of Estonian literary

and cultural heritage and, second, on the synchronous study of digitally born literary forms; we shall also offer more representative examples.

We shall also discuss the concept of DH from the viewpoint of literary studies, e.g., we examine the ways of positioning the digitally created literature (both “electronic literature”

and the literature born in social media) under this renewed concept. This problem was topical in the early 2000s, but in the following decade it was replaced by the broader ideas of

intermedia and transmedia, which treated literary texts only as one medium among many others. Which are the specific features of digital literature, which are its accompanying effects

and how has the role of the reader as the recipient changed in the digital environment? These theoretical questions are also indirectly relevant for making the literature created in the era of

printed books accessible as e-books or open data.

Digitising of older literature is the responsibility of memory institutions (libraries, archives, museums). Extensive digitising of texts at memory institutions seems to have been done for

making reading more convenient – books can be read even on smartphones. Digitising works of fiction as part of the projects for digitising cultural heritage has been carried out for more

than twenty years. What is the relation of these virtual bookshelves with the digital humanities? We need to discover whether and how do both the digitally born literature and

the digitised literature that was born in the era of printing have an effect on literary theory. Our paper will also focus on mapping different directions, practices and applications of DH in

the present day literary theory. The topical question is how to bridge the gap between the research possibilities offered by the present day DH and the ever increasing resources of texts,

produced by memory institutions. We encounter several problems. Literary scholars are used to working with texts, analysing them as undivided works of poetry, prose or drama. Using of

DH methods requires the treating of literary works or texts as data, which can be analysed and processed with computer programmes (data mining, using visualisation tools, etc.). These

activities require the posing of new and totally different research questions in literary studies. Susan Schreibman, Ray Siemens and John Unsworth, the editors of the book A New Companion to Digital Humanities (2016), discuss the problems of DH and point out in their Foreword that it is still questioned whether DH should be considered a separate discipline or, rather, a set of different interlinked methods. In our paper we emphasise the diversity of DH as an academic field of research and talk about other possibilities it can offer for literary research in addition to computational analyses of texts.

In Estonia, research on the electronic new media and the application of digital technology in the field of literary studies can be traced back to the second half of the 1990s. The analysis of

social, cultural and creative effect (see Schreibman, Siemens, Unsworth 2016: xvii-xviii), as well as constant cooperation with social sciences in the research of the Internet usage have

played an important role in Estonian literary studies.

Viires-Digital Humanities Meets Literary Studies-266_a.pdf
Viires-Digital Humanities Meets Literary Studies-266_c.pdf

4:30pm - 4:45pm
Short Paper (10+5min) [abstract]

Digital humanities and environmental reporting in television during the Cold War Methodological issues of exploring materials of the Estonian, Finnish, Swedish, Danish, and British broadcasting companies

Simo Laakkonen

University of Turku, Degree Programme on Cultural Production and Landscape Studies

Environmental history studies have relied on traditional historical archival and other related source materials so far. Despite the increasing availability of new digitized materials studies in this field have not reacted to these emerging opportunities in any particular way. The aim of the proposed paper is to discuss possibilities and limitations that are embodied in the new digitized source materials in different European countries. The proposed paper is an outcome of a research project that explores the early days of television prior to the Earth Day in 1970 and frame this exploration from an environmental perspective. The focus of the project is reporting of environmental pollution and protection during the Cold War. In order to realize this study the quantity and quality of related digitized and non-digitized source materials provided by the national broadcasting companies of Estonia (ETV), Finland (YLE), Sweden (SVT), Denmark (DR), and United Kingdom (BBC) were examined. The main outcome of this international comparative study is that the quantity and quality of available materials varies greatly, even in a surprising way between the examined countries that belonged to different political spheres (Warsaw Pact, neutral, NATO) during the Cold War.

Laakkonen-Digital humanities and environmental reporting in television during the Cold War Methodological_a.docx

4:45pm - 5:00pm
Short Paper (10+5min) [abstract]

Prosodic clashes between music and language – challenges of corpus-use and openness in the study of song texts

Heini Arjava

University of Helsinki,

In my talk I will discuss the relationship between linguistic and musical rhythm, and the connections to digital humanities and open science that arise in their study. My ongoing corpus research discusses the relationship between linguistic and musical segment length in songs, focusing on instances where the language has adapt prosodically to the rhythmic frame provided by pre-existing music. More precisely, the study addresses the question of how syllable length and note length interact in music. To what extent can non-conformity between linguistic and musical segment length, clashes, be acceptable in song lyrics, and what other prosodic features, such as stress, may influence the occurrence of clashes in segment length?

Addressing these questions with a corpus-based approach leads to questions of retrieving information retrieval complicated corpora which combine two medias (music and language), and the openness and accessibility of music sources. In this abstract I will first describe my research questions and the song corpus used in my study in section 1, and discuss their relationship with the use, analysis and availability of corpora, and issues of open science in section 2.

1. Research setting and corpus

My study aims to approach the comparison of musical and linguistic rhythm by both qualitative and statistical methods. It bases on a self-collected song corpus in Finnish, a language where syllable length has a versatile relationship with stress (cf. Hakulinen et al 2004). Primary stress in Finnish is weight-insensitive and always falls on the first syllable of a word, and syllables of any length, long or short, can be stressed or unstressed. Finnish sound segment length is also phonemic, that is, creates distinctions of meaning. Syllable length in Finnish is therefore of particular interest in a study of musical segment length, because length deviations play an evident role in language perception.

Music and text can be combined into a composition in a number of ways, but my study focuses on the situations in which language is the most dependent of music. Usually there are three alternative orders in which music and language can be combined into songs: First, text and music may be written simultaneously and influence the musical and linguistic choices of the writer at the same time (Language < – > Music). Secondly, text can precede the music, as when composers compose a piece to existing poetry (Language –> Music). And finally, the melody may exist first, as when new versions of songs are created by translating or otherwise rewriting them to familiar tunes (Music –> Language).

My research is concerned with this third relationship, because it poses the strongest constraints on the language user. The language (text) must conform to the music’s already existing rhythmic frame that is in many respects inflexible, and in such cases, it is difficult to vary the rhythmic elements of the text, because the musical space restricts the rhythmic tools available for the language user. This in turn may lead to non-neutral linguistic output. Thus the crucial question arises: How does language adapt its rhythm to music?

My corpus contains songs that clearly and transparently represent the relationship of music being created first and providing the rhythmic frame, and language having to adjust to that frame. The pilot corpus consists of 15 songs and approximately 1500 prosodically annotated syllables of song texts in Finnish, translated or otherwise adapted from different languages, or written to instrumental or traditional music. The genres include chansons, drinking songs, Christmas songs and hymns, which originate from different eras and languages, namely English, French, German, Swedish, and Italian.

One data point in the table format of the corpus is a Finnish syllable, the prosodic properties of which I compare with the rhythm of the respective notes (musical length and stress). The most basic instance of a clash between segment lengths is the instance where a short syllable ((C)V in Finnish) falls on a long note (i.e. a longer note than a basic half-beat) . Both theoretical and empirical evidence will be used to determine which length values create the clearest cases of prosodic clashes.

A crucial presupposition when problematising the relationship between a musical form and the text written to it is the notion that a song is not poetry per se (I will return to this conception in section 2). The conventions of Western art music allow for a far greater range of length distinctions than language: the syllable lengths usually fall into binary or ternary categories (e.g. short and long syllables), whereas in music notes can be elongated infinitely. A translated song in which all rhythmic restrictions come from the music may follow the lines of poetic traditions, but must deviate from them if the limits of space within music do not allow for full flexibility. It is therefore an intermediate form of verbal art.

2. Challenges for digital humanities and open science

The corpus-based approach to language and music poses problematic questions regarding digital humanities. First of these is, of course, if useful music-linguistic corpora can be found at all at the present. Existent written and spoken corpora of the major European languages contain millions of words, often annotated to a great linguistic detail (cf. Korp of Kielipankki for Finnish (, which offers detailed contextual, morphological and syntactic analysis). For music as well, digital music scores can be found “in a huge number” (Ponce de León et al. 2008:560). Corpora of song texts with both linguistic and musical information seem to be more difficult to find.

One problem of music linguistic studies is related to the more restricted openness and shareability of sources than that of written or spoken language. The copyright questions of art are in general a more sensitive issue than for instance those of newspaper articles or internet conversations, and the reluctance of the owners of song texts and melodies may have made it difficult to create open corpora of contemporary music.

But even with ownership problems aside (such as with older or traditional music), building a music-linguistic corpus remains a difficult task to comply. A truly useful corpus of music for linguistic purposes would include metadata of both medias, both language and music. Thus even an automatically analysed metric corpus of poetry, like Anatoli Starostin’s Treeton for metrical analysis of Russian poems (Pilshcikov & Starostin 2011) or the rhythmic Metricalizer for determining meter by stress patterns in German poems (Bobenhausen 2011) does not answer to the questions of rhythm of a song text, which exists in a extra-linguistic medium, music, altogether. Vocal music is metrical, but it is not metrical in the strict sense of poetic conventions, with which it shares the isochronic base. Automated analysis of a song text without its music notation does not tell anything about its real metrical structure.

On a technical level, a set of tools that is necessary for researchers of music are the tools for quick visualization of music passages (notation tools, sound recognition). Such software can be found and used freely in the internet and are useful for depiction purposes. Mining of information from music requires more effort, but has been done in various projects for instance for melody information retrieval (Ponce de León et al. 2008), or metrical detection of notes (Temperley 2001). But again, these tools seem to rarely combine linguistic and musical meter simultaneously.

By raising these questions I hope to bring attention to the challenges of studying texts in the musical domain, that is, not simply music or poetry separately. The crux of the issue is that for the linguistic analysis of song texts we need actual textual data where the musical domain appears as annotated metadata. Means exist to analyse text automatically, and to analyse musical patterns with sound recognition or otherwise, but to combine the two raises the analysis to a more complicated level.


Blumenfeld, Lev. 2016. End-weight effects in verse and language. In: Studia Metrica Poet. Vol. 3.1 pp. 7–32.

Bobenhausen, Klemens. 2011. The Metricalizer – Automated Metrical Markup of German Poetry. In: Küper, C. (ed.), Current trends in metrical analysis, pp. 119-131. Frankfurt am Main; New York: Peter Lang.

Hayes, Bruce. 1995. Metrical Stress Theory: principals and case studies. Chicago: The University of Chicago Press.

Hakulinen, et al. (eds.). 2004. Iso suomen kielioppi, pp.44–48. Helsinki: Suomalaisen Kirjallisuuden Seura.

Jeannin, M. 2008. Organizational Structures in Language and Music. In: The World of Music,50(1), pp. 5–16.

Kiparsky, Paul. 2006. A modular metrics for folk verse. In: B. Elan Dresher & Nila Friedberg (eds.), Formal approaches to poetry: recent developments in metrics, pp.7–52. Berlin: Mouton de Gruyter.

Lerdahl, Fred & Jackendoff, Ray. 1983. A generative theory of tonal music. Cambridge (MA): MIT.

Lotz, John. 1960. Metric typology. In: Thomas Sebeok (ed.), Style in language. Massachusetts: The M.I.T. Press.

Palmer, Caroline & Kelly, Michael H. 1992. Linguistic Prosody and Musical Meter in Song.

Journal of memory and language 31, pp. 525–542.

Pilshchikov, Igor & Starostin, Anatoli. 2011. Automated Analysis of Poetic Texts and the Problem of Verse Meter. In: Küper, C. (ed.), Current trends in metrical analysis, pp. 133–140. Frankfurt am Main; New York: Peter Lang.

Ponce de León, Pedro J., Iñesta, José M. & Rizo, David. 2008. Mining Digital Music Score Collections: Melody Extraction and Genre Recognition. In: Peng-Yeng Yin (ed.), Pattern Recognition Techniques, Technology and Applications, pp. 626–. Vienna: I-Tech.

Temperley, D. 2001. The Cognition Of Basic Musical Structures. Cambridge, Mass: MIT Press.

Arjava-Prosodic clashes between music and language – challenges-219_a.pdf
Arjava-Prosodic clashes between music and language – challenges-219_c.pdf

5:00pm - 5:15pm
Distinguished Short Paper (10+5min) [abstract]

Finnish aesthetics in scientific databases

Darius Pacauskas, Ossi Naukkarinen

Aalto University School of Arts, Design and Architecture

The major academic databases such as Web of Science and Scopus are dominated by publications written in English, often by scholars affiliated to American and British universities. As such databases are repeatedly used as basis for assessing and analyzing activities and impact of universities and even individual scholars, there is a risk that everything published in other, especially minor languages, will be sidetracked. Standard data-mining procedures do not notice them. Yet, especially in humanities, other languages and cultures have an important role and scholars publish in various languages.

The aim of this research project is to critically look into how Finnish aesthetics is represented in scientific databases. What kind of picture of Finnish aesthetics can we draw if we rely on the metadata from commonly used databases?

We will address this general issue through one example. We will compare metadata from two different databases, in two different languages, English and Finnish, and form a picture of two different interpretations of an academic field, aesthetics - or estetiikka in Finnish. To achieve this target we will employ citation analysis, as well as text summarization techniques, in order to understand the differences lying between the largest world scientific database - Scopus, and the largest Finnish one - Elektra. Moreover, we will identify the most influential Finnish aestheticians and analyze their publications record in order to understand to what extent the scientific databases can represent Finnish aesthetics. Through this, we will present 1) two different maps containing actors and works recognized in the field, and 2) an overview of the main topics from two different databases.

For these goals, we will collect metadata from the both Scopus and Elektra databases and references from each relevant article. Relevant articles will be located by using keyword “aeshetics” or the Finnish equivalent “estetiikka”, as well as identifying scientific journals focusing on aesthetics. We will perform citation analysis to explore in which countries which publications are cited, based on Scopus data. This comparison will allow us to understand what are the most prominent works for different countries, as well as to find the countries in which those works are developed, e.g., works that are acknowledged by Finnish aestheticians according to international database. In addition, the comparison will allow us to understand how Finnish aesthetics differs from other countries.

Later, we will perform citation analysis with the data gathered from the Finnish scientific database Elektra. Results will indicate distribution between cited Anglo-American texts and the ones written in Finland or in Finnish language. Thus we could understand which language-family sources Finnish aestheticians rely on in their works. Further we will apply text summary techniques to see the differences in the topics both databases are discussing. Furthermore, we will collect a list names of the most influential Finnish aestheticians, and their works (as provided by the databases). We will perform searches within two databases to understand how much of their works are covered.

As additional contribution, we will be developing an interactive web based tool to represent results of this research. Such tool will give an opportunity for aesthetics researchers to explore Finnish aesthetics field through our established lenses and also comment on possible gaps in the pictures offered by the databases. It is possible that databases only give a very partial picture of the field and in this case new tools should be developed in co-operation with researchers. The similar situation might be true also in other sub-fields of humanities where non-English activities are usual.

Pacauskas-Finnish aesthetics in scientific databases-143_a.pdf
4:00pm - 5:30pmF-TC-2: Games as Culture
Session Chair: Frans Mäyrä
Think Corner 
4:00pm - 4:15pm
Short Paper (10+5min) [abstract]

The Science of Sub-creation: Transmedial World Building in Fantasy-Based MMORPGs

Rebecca Anderson

University of Waterloo, The Games Institute, First Person Scholar

My paper examines how virtual communities are created by fandoms in massively multi-player online role-playing games and it explores what kinds of self-construction emerge in these digital locales and how such self-construction reciprocally affects the living culture of the game. I assert that the universe of a fantasy-based MMORPG necessitates participatory culture: experiencing the story means participating in the culture of the story’s world; these experiences reciprocally affect the living culture of the game’s universe. The participation and investment of readers, viewers, and players in this world constitute what Carolyn Marvin calls a textual community or a group that “organize[s] around a presumptively shared, but distinctly practiced, epistemology of texts and interpretive procedures” (12). In other words, the textual community produces a shared discourse, one that informs and interrogates what it means to be a fan in both analogue and digital environments.

My paper uses J.R.R. Tolkien’s Middle-earth as a case study to explore the creation and continuation of a fantastic universe, in this case Middle-earth, across mediums: a transmedial creation informed by its textual community. Building on the work of Mark J.P. Wolf, Colin B. Harvey, Celia Pearce, Matthew P. Miller, and Edward Castronova, my work reveals that the “worldness” of a transmedia universe, or the degree to which it exists as a complete and consistent cosmos, plays a core role in the production, acceptance, and continuation of its ontology among and across the fan communities respective to the mediums in which it operates. My paper argues that Tolkien’s literary texts and these associated adaptations are multi-participant sites in which participants negotiate their sense of self within a larger textual community. These multi-participant sites form the basis from which to investigate the larger social implications of selfhood and fan participation.

My theoretical framework provides the means by which to situate the critical aesthetics relative to how this fictional universe draws participants in. Engaging with Gordon Calleja’s discussions on immersion and Luis O. Arata’s thoughts on interactivity, I demonstrate how the transmedial storyworld of Middle-earth not only constructs a sense of space but that it is precisely this sense of space that engages the reader, viewer or gamer. To situate the sense of self incurred between and because of narrative and storyworld environment, I draw from Andreas Gregersen’s work on embodiment and interface, as well as from Shawn P. Wilbur’s work on identity in virtual communities. Anne Balsamo and Rebecca Borgstrom each offer a theorization of the role-playing specific to the multiplayer environments of game-based adaptations, while William H. Huber’s work contextualizes the production of space in epic fantasy narratives. Together, my theoretical framework highlights how the spread of a transmedial fantastic narrative impacts the connection patterns across the textual community of a particular storyworld, as well as foregrounds how the narrative environment shapes the degree of participant engagement in and with the space of that storyworld.

This proposal is for a long paper presentation; however, I'm able to condense if necessary to fit a short paper presentation.

Anderson-The Science of Sub-creation-148_a.pdf

4:15pm - 4:30pm
Distinguished Short Paper (10+5min) [abstract]

Layers of History in Digital Games

Derek Fewster

University of Helsinki,

The past five years have seen a huge increase in historical games studies. Quite a few texts have tried to approach how history is presented and used in games, considering everything from philosophical points to more practical views related to historical culture and the many manifestations of heritage politics. The popularity of recent games like Assassin’s Creed, The Witcher and Elder Scrolls also manifests the current importance of deconstructing the messages and choices the games present. Their impact on the modern understanding of history, and the general idea of time and change, is yet to be seen in its full effect.

The paper at hand is an attempt to structure the many layers or horizons of historicity in digital games as these, into a single taxonomic system for researchers. The suggestion considers the various consciousnesses of time and narrative models modern games work with. Several distinct horizons of time, both of design and of the related real life, are interwoven to form the end product. The field of historical game studies could find this tool quite useful, in its urgent need to systematize how digital culture is reshaping our minds and pasts.

The model considers aspects like memory culture, uses of period art and apocalyptic events, narrative structures, in-game events and real world discourses as parts of how a perception of time and history is created or adapted. The suggested “layering of time” is applicable on a wide scale of digital games.

Fewster-Layers of History in Digital Games-265_a.docx

4:30pm - 4:45pm
Short Paper (10+5min) [abstract]

Critical Play, Hybrid Design and the Performance of Cultural Heritage Game/Stories

Lissa Holloway-Attaway

University of Skövde

In my talk, I propose to discuss the critical relationship between games designed and developed for cultural heritage and emergent Digital Humanities (DH) initiatives that focus on (re-)inscribing and reflecting on the shifting boundaries of human agency and its attendant relations. In particular, I will highlight theoretical and practical humanistic models (for development and as objects of scholarly research) that are conceived in tension with more computational emphases and influences. I examine how digital heritage games move us from an understanding of digital humanities as a “tool” or “text” oriented discipline to one where we identify critical practices that actively engage and promote convergent, hybrid and ontologically complex techno-human subjects to enrich our field of inquiry as DH scholars.

Drawing on principles such as embodiment, affect, and performativity, and analyzing transmedial storytelling and mixed reality games designed for heritage settings (and developed in my university research group), I argue for these games as an exemplary medium for enriching interdisciplinary digital humanities practices using methods currently called upon by recent DH scholarship. In these fully hybrid contexts where human/technology boundaries are richly intermingled, we recognize the importance of theoretical approaches for interpretation that are performative, not mechanistic (Drucker, in Gold, 2011): That is we look at emergent experiences, driven by human intervention, not affirmed by technological development and technical interface affordances. Such hybridity, driven by human/humanities approaches is explored more fully, for example, in Digital_Humanities by Burdick et al (2012) and by N. Katherine Hayles in How We Think: Digital Media and Contemporary Technogenesis (2012). Collectively these scholars reveal how transformative and emerging disciplines can work together to re-think the role of the organic-technical beings at the center (and found at the margins and in-between subjectivities) within new forward-thinking DH studies. Currently, Hayles and others, like Matthew Gold (2012) offer frameworks for more interdisciplinary Digital Humanities methods (including Comparative Media and Culture Studies approaches) that are richly informed by investigations into the changing role and function of the user of technologies and media and the human/social contexts for use. Hayles, for example, explicitly claims that in Digital Humanities humans “ think, through, with, and alongside media” (1). In essence, our thinking and being, our digitization and our human-ness are mutually productive and intertwined. Furthermore, we are multisensory in our access to knowing and we develop an understanding of the physical world in new ways that reorient our agencies and affects, redistributing them for other encounters with cultural and digital/material objects that are now ubiquitous and normalized.

Ross Parry, museum studies scholar, supports a similar model for inquiry and future advancement, based on the premise that digital tool use is now fully implemented and accepted in museum contexts, and so now we must deepen and develop our inquiries and practice (Parry, 2013). He claims that digital technologies have become normative in museums and that currently we find ourselves, then, in the age of the postdigital. Here critical scrutiny is key and necessary to mark this advanced state of change. For Parry this is an opportune, yet delicate juncture that requires a radical deepening of our understanding of the museums’ relationship to digital tools:

Postdigitality in the museum necessitates a rethinking of upon what museological and digital heritage research is predicated and on how its inquiry progresses. Plainly put, we have a space now (a duty even) to reframe our intellectual inquiry of digital in the museum to accommodate the postdigital condition. [Parry, 36]

For Parry, as with current DH calls for development, we must now focus on the contextualized practices in which these technologies will inevitably engage designers and users and promote robust theoretical and practical applications.

I argue that games, and in particular digital games designed for heritage experiences, are unique training grounds for such postdigital future development. They provide rich contexts for DH scholars working to deepen their understanding of performative and active interventions and intra-actions beyond texts and tools. As digital games have been adopted and ubiquitously assimilated in museums and heritage sites, we have opportunities to study experiences of users as they performatively engage postdigital museum sites through rich forms of hybrid play. In such games, nuanced forms of interdisciplinary communication and storytelling happen in deeply integrated and embedded user/technology relationships. In heritage settings, interpretation is key to understanding histories from multiple user-driven perspectives, and it happens in acts of dynamic emergence, not as the result of mechanistic affordance. As such DH designers and developers have much to learn from a rich body of games and heritage research, particularly that focused on critical and rhetorical design for play, Mixed Reality (MR) approaches and users’ bodies as integral to narrative design (Anderson et. al, 2010; Bogost, 2010; Flanagan, 2013; Mortara et. al, 2014; Rouse et. al, 2015; Sicart, 2011). MR provides a uniquely layered approach working across physical and digital artifacts and spaces, encouraging polysemic experiences that can support curators’ and historians’ desires to tell ever more complex and connected stories for museum and heritage site visitors, even involving visitors’ own voices in new ways. In combination, critical game design approaches and MR technologies, within the museum context, help re-center historical experience on the visitor’s body, voice, and agency, shifting emphasis away from material objects, also seen as static texts or sites for one-way, broadcast information. Re-centering the design on users’ embodied experience with critical play in mind, and in MR settings, offers rich scholarship for DH studies and provides a variety of heritage, museum, entertainment, and participatory design examples to enrich the field of study for open, future and forward thinking.

Drawing on examples from heritage games developed within my university research group and in the heritage design network I co-founded, and implemented in museum and heritage sites, I will work to expose these connections. From transmedial children’s books focused on Nordic folktales, to playful AR experiences that expose the history of architectural achievements, as well as the meta reflections on the telling of those achievements in archival documentations (such as the development of the Brooklyn Bridge in the 19th C) I will provide an overview of how digital heritage games, in combination with new hybrid DH initiatives can be used for future development and research. This includes research around new digital literacies, collaborative and co-design approaches (with users) and experimental storytelling and narrative approaches for locative engagement in open-world settings, dependent on input from user/visitors.


Anderson, E. F., McLoughlin, L., Liarokapis, F., Peters, C., Petridis, P., de Freitas, S.

Developing Serious Games for Cultural Heritage: A State-of-the-Art Review. In: Virtual Reality 14 (4). (2010)

Burdick, A., Drucker, J., Lunenfeld, P., Presner, T., Schnapp, J. Digital_Humanities. MIT Press, Cambridge, MA (2012)

Bogost, I. Persuasive Games: The Expressive Power of Videogames. MIT Press, Cambridge MA (2010)

Flanagan, M. Critical Play: Radical Game Design. MIT Press, Cambridge MA (2013)

Gold, M. K. Debates in the Digital Humanities. University of Minnesota Press, Minneapolis, MN (2012)

Hayles, K. N. How We Think: Digital Media and Contemporary Technogenesis. Chicago, University of Chicago Press, Chicago Il (2012)

Parry, R. The End of the Beginning: Normativity in the Postdigital Museum. In: Museum Worlds: Advances in Research, vol. 1, pgs. 24-39. Berghahn Books (2013)

Mortara, M., Catalano, C.E., Bellotti, F., Fiucci, G., Houry-Panchetti, M., Panagiotis, P. Learning Cultural Heritage by Serious Games. In: Journal of Cultural Hertiage, vol. 15, no. 3, pp. 318-325. (2014)

Rouse, R., Engberg, M., JafariNaimi, N., Bolter, J. D. (Guest Eds.) Special Section:

Understanding Mixed Reality. In: Digital Creativity, vol. 26, issue 3-4, pp. 175-227. (2015)

Sicart, M. The Ethics of Computer Games. MIT Press, Cambridge MA (2011)

Holloway-Attaway-Critical Play, Hybrid Design and the Performance of Cultural Heritage GameStories-252_a.pdf

4:45pm - 5:00pm
Short Paper (10+5min) [publication ready]

Researching Let’s Play gaming videos as gamevironments

Xenia Zeiler

University of Helsinki

Let’s Plays, as a specific form of gaming videos, are a rather new phenomenon and it is not surprising that they are still relatively under-researched. So far, only a few publications focus on the theme. The specifics of Let’s Play gaming videos make them an unparalleled object of research in the vicinity of games – in the so-called gamevironments. The theoretical and methodical approach of the same name and literally merging the terms “games/gaming” – “environments” is first mentioned and discussed by Radde-Antweiler, Waltemathe and Zeiler 2014 who argue to broaden the study of video games, gaming and culture beyond media-centred approaches to better highlight recipient perspectives and actor-centred research. Gamevironments thus puts the spotlight on actors in their mediatized – and specifically gametized – life.

Zeiler-Researching Let’s Play gaming videos as gamevironments-185_a.pdf

5:00pm - 5:15pm
Short Paper (10+5min) [abstract]

The plague transformed: City of Hunger as mutation of narrative and form

Jennifer J Dellner

Ocean County College, United States of America,

This short paper proposes and argues the hypothesis that Minna Sundberg’s interactive game in development, City of Hunger, an offshoot or spin-off of her well respected digital comic, Stand Still Stay Silent, can be understood in terms of the ecology of the comic as a mutation of it; as such, her appropriation of a classic game genre and her storyline’s emphasis on the mechanical over the natural suggest promising avenues for understanding the uses of interactivity in the interpretation of narrative. In the game, the plague-illness of the comic’s ecology may or may not be gone, but conflict (vs. cooperation) becomes the primary mode of interaction for characters and reader-players alike. In order to produce the narrative, the reader-player will have to do battle as the characters do. Sundberg herself signals that her new genre is indivisible from the different ecology of the game world’s narrative. “City of Hunger will be a 2d narrative rpg with a turn-based battle system, mechanically inspired by your older final fantasy games, the Tales of-series and similar classical rpg's.” There will be a world of “rogue humans, mechanoids and mysterious alien beings to fight” (2017). While it remains to be seen how the game develops, its emphasis on machine-beings and aliens in a classic game environment ( a “shadow of the past”) suggests strongly that the use of interactivity within each narrative has an interpretive and not merely performative dimension.

Dellner-The plague transformed-274_a.pdf

5:15pm - 5:30pm
Short Paper (10+5min) [abstract]

Names as a Part of Game Design

Lasse Hämäläinen

University of Helsinki,

Video games often consist of several separate spaces of play. They are called, depending on the speaker and the type of the game, for example levels, maps, tracks or worlds. In this paper, the term level is used. As there are usually many levels in a game, they need some kind of identifying elements. In some games, levels only have ordinal numbers (Level 1, Level 2 etc.), but in the other, they (also) have names.

Names are an important part of game design, at least for three reasons. Firstly, giving names to places makes the imaginary world feel richer and deeper (Schell 2014: 351), improving the gameplay experience. Secondly, name gives the player first impression of the level (Rogers 2014: 220), helping him/her to perceive the level’s structure. And thirdly, level names are needed for discussing the levels. Members of a gaming community often want to share their experiences and emotions of the gameplay. When doing so, it is important to contextualize the events: in which level did X happen?

Even though some game design scholars seem to recognize the importance of names, there are very few studies of them. This presentation is aimed to fill this blank. I have analyzed level names in Playforia Minigolf, an online minigolf game designed in Finland in 2002. The data include names all the 2,072 levels in the game. The analysis focuses especially on the principles of naming, or in other words, what kind of connection there is between the name and level’s characteristics.

The presentation also examines the change of naming practices during the game’s 15-year history. The oldest names mostly describe the levels in a simple, neutral manner, while the newest names are far more ambiguous and rarely have anything to do with level’s characteristics. This change is probably caused by the change of level designers. First levels of the game were designed by its developers, game design professionals, but over time, the responsibility of designing levels has passed to the most passionate hobbyists of the game. This result might be an interesting for game studies and especially for the research of modding and modifications (see e.g. Unger 2012).


Playforia (2002). Minigolf. Finland: Apaja Creative Solutions Oy.

Rogers, Scott (2014). Level Up! The Guide to Great Video Game Design. Chichester: Wiley.

Schell, Jesse (2014). The Art of Game Design: A Book of Lenses. CRC Press.

Unger, Alexander (2012). Modding as a Part of Gaming Culture. – Fromme, Johannes & Alexander Unger (eds.): Computer Games and New Media Cultures. A Handbook of Digital Games Studies, 509–523.

Hämäläinen-Names as a Part of Game Design-118_a.pdf
Hämäläinen-Names as a Part of Game Design-118_c.pdf
5:30pm - 8:00pmDHN2018 closing party
Think Corner

Contact and Legal Notice · Contact Address:
Conference: DHN 2018
Conference Software - ConfTool Pro 2.6.122
© 2001 - 2018 by Dr. H. Weinreich, Hamburg, Germany