Digital Humanities in the Nordic Countries 3rd Conference

11:00am - 11:30am
Long Paper (20+10min) [abstract]

Shaping data futures: Towards non-data-centric data activism

Minna Ruckenstein¹, Tuukka Lehtiniemi²

¹Consumer Society Research Centre, University of Helsinki, Finland,; ²HIIT, Aalto University

The social science debate that attends to the exploitative forces of the quantification of aspects of life previously experienced in qualitative form, recognising the ubiquitous forms of datafied power and domination, is by now an established perspective to question datafication and algorithmic control (Ruckenstein and Schüll, 2017). Drawing from the critical political economy and neo-Foucauldian analyses researchers have explored the effects of the datafication (Mayer-Schönberger and Cukier. 2013; Van Dijck, 2014) on the economy, public life, and self-understanding. Studies alert us to threats to privacy posed by “dataveillance” (Raley, 2012; Van Dijck, 2014), forms of surveillance distributed across multiple interested parties, including government agencies, insurance payers, operators, data aggregators, analytics companies, and individuals who provide the information either knowingly or unintentionally when going online, using self-tracking devices, loyalty programs, and credit cards. The “data traces” add to the data accumulated in databases and personal data – any data related to a person or resulting from actions by a person – becomes utilized for business and societal purposes in an increasingly systematic matter (Van Dijck and Poell, 2016; Zuboff, 2015).

In this paper, we take an “activist stance”, aiming to contribute to the current criticism of datafication with a more participatory and collaborative approach offered by “data activism” (Baack 2015; Milan and van der Velden, 2016), and civic and political engagement spurred by datafication. The various data-driven initiatives currently under development suggest that the problematic aspects of datafication, including the tension between data openness and data ownership (Neff, 2013), the asymmetries in terms of data usage and distribution (Wilbanks and Topol, 2016; Kish and Topol, 2015) and the inadequacy of existing informed consent and privacy protections (Sharon, 2016) are by now not only well recognized, but they are generating new forms of civic and political engagement and activism. This calls for more debate on what these new forms of data activism are and how scholars in the humanities and social science communities can assess them.

By relying on the approaches developed within the field of Techno-Anthropology (Børsen and Botin, 2013; Ruckenstein and Pantzar, 2015), seeking to translate and mediate knowledge concerning complex technoscientific projects and aims, we positioned ourselves as “outside insiders” with regard to a data-centric initiative called MyData. In 2014, we became observers and participants of the MyData, promoting the understanding that people benefit when they can control data gathering and analysis by public organizations and businesses and become more active data citizens and consumers. The high-level MyData vision, described in ‘the MyData white paper’ written primarily by researchers at the Helsinki Institute for Information Technology and the Tampere University of Technology (Poikola et al., 2015), outlines an alternative future that transforms the ’organisation-centric system‘ into ’a human-centric system‘ that treats personal data as a resource that the individual can access, control, benefit and learn from.

The paper discusses “our” data activism and the activism of technology developers, promoting and relying on two different kinds of “social imaginaries” (Taylor, 2004). By doing so, we open a perspective to data activism that highlights ideological and political underpinnings of contested social imaginaries and aims. Current data-driven initiatives tend to proceed with a social imaginary that treats data arrangements as solutions, or corrective measures addressing unsatisfactory developments. They advance a logic of an innovation culture, relying on the development of new technology structures and computationally intensive tools. This means that the data-driven initiatives rely on an engineering attitude that does not question the power of technological innovation for creating better societal solutions or, more broadly, the role of datafication in societal development. The main focus is on the correct positioning of technology: undesirable, or harmful developments need to be reversed, or redirected towards ethically more fair and responsible practices.

Since we do not possess impressive technology skills, or proficiency in legal and regulatory matters, which would have aligned us with the innovation-driven data activism, our position in the technology-driven data activism scene is structurally fairly weak. Our data activism is informed by a sensitivity to questions of cultural change and the critical stance representative to social scientific inquiry, questioning the optimistic and future-oriented social imaginary of technology developers. As will be discussed in our presentation, this means that our data activism is incompatible with those of technology developers in a profound sense, explaining why our activist role was repeatedly reduced to viewing a stream of diagrams on PowerPoint slides depicting databases and data flows. In terms of designing future data transfers and data flows, our social imaginary remained oddly irrelevant, intensifying the feeling that we were observing a moving target and our task was to simply keep up, while the engineers were busy doing to the real work of activists, developing approaches that give users more control over their personal data, such as the Kantara Initiative’s User-Managed Access (UMA) protocol, experimenting with Blockchain technologies for digital identities such as Sovrin, and learning about “Vendor Relationship Management” systems (see, Belli et al., 2017).

From the outsider position, we started to craft a narrative about the MyData initiative that aligns with our social imaginary. We wanted to push the conversation further, beyond the usual technological, legal and policy frameworks, and suggest that with its techno-optimism the current MyData work might actually weaken data activism and public support for it. We turned to literary and scholarly sources with the aim of opening a critical, but hopefully also a productive conversation about MyData in order to offer ideas of how to promote socially more robust data activism. A seminal text that shares aims of the MyData initiative is the Autonomous Technology – Technics-out-of-Control as a Theme in Political Thought (1978) by Langdon Winner. Winner perceives the relationship between human and technology in terms of Kantian autonomy: via analysis of interrelations of independence and dependence. The core ideas of the MyData vision have particular resonance with the way Winner (1978) considers “reverse adaptation”, wherein the human adapts to the power of the system and not the other way around.

In this paper, we first describe the MyData vision, as it has been presented by the activists, and situate it in the framework of technology critique and current critique of the digital culture and economy. Here, we demonstrate that the outside position can, in fact, resource a re-articulation of data activism. After this, we detail some further developments in the MyData scene and possibilities that have opened for dialogue and collaboration during our data activism journey. We end the discussion by noting that for truly promoting societally beneficial data arrangements, work is needed to circumvent the individualistic and data-centric biases of initiatives such as the MyData. We promote non-data-centric data activism that meshes critical thinking into the mundane realities of everyday practices and calls for historically informed and collectively oriented alternatives and action.

Overall, our goal is to demonstrate that with a focus on ordinary people, professionals and communities of practice, ethnographic methods and practice-based analysis can deepen understandings of datafication by revealing how data and its technologies are taken up, valued, enacted, and sometimes repurposed in ways that either do not comply with imposed data regimes, or mobilize data in inventive ways (Nafus & Sherman, 2014). By learning about everyday data worlds and actual material data practices, we can strengthen the understanding of how data technologies could become a part of promoting and enacting more responsible data futures. Paradoxically, in order to arrive to an understanding of how data initiatives support societally beneficial developments, non-data-centric data activism is called for. By aiming at non-data-centric data activism, we can continue to argue against triumphant data stories and technological solutionism in ways that are critical, but do not deny the possible value of digital data in future making. We will not try to protect ourselves against data forces but act imaginatively with and within them to develop new concepts, frameworks and collaborations in order to better steer them.

References

Baack, S. 2015. Datafication and empowerment: How the open data movement re-articulates notions of democracy, participation, and journalism. Big Data & Society, Oct.

Belli, L., Schwartz, M., & Louzada, L. (2017). Selling your soul while negotiating the conditions: from notice and consent to data control by design. Health and Technology, 1-15.

Børsen, T. & Botin, L. (eds) (2013). What Is Techno-Anthropology? Aalborg, Denmark: Aalborg University Press.

Kish, L. J., & Topol, E. J. (2015). Unpatients: why patients should own their medical data. Nature biotechnology, 33(9), 921-924.

Mayer-Schönberger, V., and K. Cukier. (2013). Big data: a revolution that will transform how we live, work, and think. Boston: Houghton Mifflin Harcourt.

McQuillan, D. (2016). Algorithmic Paranoia and the Convivial Alternative. Big Data and Society 3(2).

McStay, Andrew (2013). Privacy and Philosophy: New Media and Affective Protocol. New York: Peter Lang.

Milan, S., & Velden, L. V. D. (2016). The alternative epistemologies of data activism. Digital Culture & Society, 2(2), 57-74.

Nafus, D. and Sherman, J. (2014). This One Does Not Go Up to 11: The Quantified Self Movement as an Alternative Big Data Practice. International Journal of Communication 8: 1784-1794.

Poikola, A.; Kuikkaniemi, K.; & Kuittinen, O. (2014). My Data – Johdatus ihmiskeskeiseen henkilötiedon hyödyntämiseen [‘My Data – Introduction to Human-centred Utilisation of Personal Data’]. Helsinki: Finnish Ministry of Transport and Communications.

Poikola, A.; Kuikkaniemi, K.; & Honko, H. (2015). MyData – a Nordic Model for Human-centered Personal Data Management and Processing. Helsinki: Finnish Ministry of Transport and Communications.

Raley, R. (2013). Dataveillance and Counterveillance, in ed. Gitelman, Raw Data is an Oxymoron. Cambridge: MIT Press.

Ruckenstein, M. & Pantzar, M. (2015). Datafied life: Techno-anthropology as a site for exploration and experimentation. Techné: Research in Philosophy & Technology. 19(2), 191–210.

Ruckenstein, M., & Schüll, N. D. (2017). The Datafication of Health. Annual Review of Anthropology, (0).

Sharon, T. (2016) Self-Tracking for Health and the Quantified Self: Re-Articulating Autonomy, Solidarity, and Authenticity in an Age of Personalized Healthcare. Philosophy & Technology, 1-29.

Taylor, C. (2004). Modern social imaginaries. Duke University Press.

Van Dijck, J. (2014). Datafication, dataism and dataveillance: Big data between scientific paradigm and ideology. Surveillance and Society 12(2): 197–208

Van Dijck, J., & Poell, T. (2016) Understanding the promises and premises of online health platforms. Big Data & Society, 3(1), 1-11.

Wilbanks, J. T., & Topol, E. J. (2016). Stop the privatization of health data. Nature, 535, 345-348.

Winner, L. (1978). Autonomous Technology – Technics-out-of-Control As a Theme in Political Thought. Cambridge, Massachusetts, & London: The MIT Press.

Zuboff, Shoshana. 2015. “Big Other: Surveillance Capitalism and the Prospects of an Information Civilization.” Journal of Information Technology 30: 75–89.

11:30am - 11:45am
Short Paper (10+5min) [publication ready]

Digitalisation of Consumption and Digital Humanities - Development Trajectories and Challenges for the Future

Toni Ryynänen, Torsti Hyyryläinen

University of Helsinki, Ruralia Institute

Digitalisation transforms practically all areas of the modern life: everything

that can, will be digitalised. Especially the everyday routines and consumption

practices are under continual change. New digital products and services

are introduced at an accelerating pace. Purpose of this article is two-fold: the first

aim is to explore the influence of digitalisation on consumption, and secondly, to

canvas reasons for these digitalisation-driven transformations and possible future

progressions. The transformations are explored through recent consumer studies

and the future development is based on interpretations about digitalisation. Our

article recounts that digitalisation of consumption have resulted in new forms of

e-commerce, changing consumer roles and the digital virtual consumption. Reasons for these changes and expected near future progressions are based on assumptions drawn from data-driven, platform-based and disruption-generated visions. Challenges of combining consumption and the digital humanities approach

are discussed in the conclusion Section of the article.

11:45am - 12:00pm
Short Paper (10+5min) [abstract]

Its your data, but my algorithms

Tomi Dufva

Aalto-University, the school of Arts, Design and Architecture,

The world is increasingly digital, but the understanding of how the digital affects everyday life is still often confused. Digitalisation is sometimes optimistically thought as a rescue from hardships, be it economical or even educational. On the other hand, digitalization is seen negatively as something one just can’t avoid. Digital technologies have replaced many previous tools used in work as well as in leisure. Furthermore, digital technologies present an agency of their own into the human processes as marked by David Berry. Through manipulating data through algorithms and communicating not only with humans, but other devices as well, digital technology presents new kind of challenges for the society and individual. These digital systems and data flow get their instructions from the code that runs on these systems. The underneath code itself is not objective nor value-free and carries own biases as well as programmers, software companies or larger cultural viewpoints objectives. As such, digital technology affects to the ways, we structure and comprehend, or are even able to comprehend the world around us.

This article looks at the surrounding digitality through an artistic research project. Through using code not as a functional tool but in a postmodern way as a material for expression, the research focuses on how code as art can express the digital condition that might otherwise be difficult to put into words or comprehend in everyday life. The art project consists of a drawing robot controlled by EEG-headband that the visitor can wear. The headband allows the visitor to control the robot through the EEG-readings read by the headband. As such the visitor might get a feeling of being able to control the robot, but at the same time the robot interprets the data through its algorithms and thus controls the visitor's data.

The aim of this research projects is to give perspectives to the everydayness of digitality. It wants to question how we comprehend digital in everyday life and asks how we should embody digitality in the future. The benefits of artistic research are in the way it can broaden the conceptions of how we know and as such can deepen one’s understanding of the complexities of the world. Furthermore, artistic research can expand the meaning to alternative interpretations of the research subjects. As such, this research project aims at the same time to deepen the discussion of digitalization and to broaden it to alternative understandings. The alternative ways of seeing a phenomenon, like digitality, are essential in the ways future is developed. The proposed research consists of both the theoretical text and the interactive artwork, which would be present in the conference.

Poster [abstract]

Shearing letters and art as digital cultural heritage, co-operation and basic research

Maria Elisabeth Stubb

Svenska litteratursällskapet i Finland,

Albert Edelfelts brev (edelfelt.fi) is a web publication developed at the Society of Swedish Literature in Finland. In co-operation with the Finnish National Gallery, we publish letters of the Finnish artist Albert Edelfelt (1854–1905) combined with pictures of his artworks. Albert Edelfelts brev received in 2016 the State Award for dissemination of information. The co-operation between institutions and basic research of the material has enabled a unique reconstruction of Edelfelt’s artistry and his time, for the service of researchers and other users. I will present how we have done it and how we plan to further develop the website.

The website Albert Edelfelts brev launched in September 2014, with a sample of Edelfelt’s letters and paintings. Our intention is to publish all the letters Albert Edelfelt wrote to his mother Alexandra (1833–1901). The collection consists of 1 310 letters, that range over 30 years and cover most of Edelfelt’s adult life. The letters are in the care of the Society of Swedish Literature in Finland. We also have to our disposal close to 7 000 pictures of Edelfelt’s paintings and sketches in the care of the Finnish National Gallery.

In the context of digital humanities, the volume of the material at hand is manageable. However, for researchers who think that they might have use of the material, but are unsure of exactly where or what to look for, it might be labour intensive to go through all the letters and pictures. We have combined professional expertise and basic research of the material with digital solutions to make it as easy as possible to take part of what the content can offer.

As editor of the web publication, I spend a considerable part of my work on basic research in identifying people, and pinpointing paintings and places that Edelfelt mentions in his letters. By linking the content of a letter to artworks, persons, places and subjects/reference words users can easily navigate in the material. Each letter, artwork and person has a page of its own. Even places and subjects are searchable and listed.

The letters are available as facsimile pictures of the handwritten pages. Each letter has a permanent web resource identifier (URN:NBN). In order to make it easier for users to decide if a letter is of interest, we have tagged subjects using reference words from ALLÄRS (common thesaurus in Swedish). We have also written abstracts of the content, divided them into separate “events” and tagged mentioned artworks, people and places to these events.

Each artwork of Edelfelt has a page of its own. Here, users find a picture of the artwork (if available) and earlier sketches of the artwork (if available). By looking at the pictures, they can see how the working process of the painting has developed. Users can also follow the process through what Edelfelt writes in his letters. All the events from the letter abstracts that are tagged to the specific artwork are listed in chronological order on the artwork-page.

Persons tagged in the letter abstracts also have pages of their own. On a person-page, users find basic facts and links to other webpages with information about the person. Any events from the letter abstracts mentioning the person are listed as well. In other words, through a one-click-solution users can find an overview on everything Edelfelt’s letters have to say about a specific person. Tagging persons to events has also made it possible to build graphs of a person’s social network; based on how many times other persons are tagged to the same events as the specific person. There is a link to these graphs on every person-page.

Apart from researchers who have a direct interest in the material, we have also wanted to open up the cultural heritage to a broader public and group of users. Each month the editorial staff writes a blog-post on SLS-bloggen (http://www.sls.fi/sv/blogg). Albert Edelfelts brev also has a profile on Facebook (https://www.facebook.com/albertedelfeltsbrev/) where we post excerpts of letters on the same date as Edelfelt wrote the original letter. By doing so we hope to give the public an insight in the life of Edelfelt and the material, and involve them in the progress of the project.

The web publication has open access. The mix of different sources and the co-operation with other heritage institutions has led to a mix of licenses for how users can copy and redistribute the published material. The Finnish National Gallery (FNG) owns copyright on its pictures in the publication and users have to get permission from FNG to copy and redistribute that material. The artwork-pages contain descriptions of the paintings written by the art historian Bertel Hintze, who published a catalogue of Edelfelt’s art in 1942. These texts are licensed with a Creative Commons Attribution-NoDerivs 4.0 Generic (CC BY-ND 4.0). Edelfelt’s letters as well as the texts and metadata produced by the editorial staff at the Society of Swedish Literature in Finland have a Creative Commons CC0 1.0 Universal-license. Data with Creative Commons-license is also freely available as open data through a REST API (http://edelfelt.sls.fi/apiinfo/).

In the future, we would like to find a common practice for the user rights; if possible, even so all the material would have the same license. We intend to invite other institutions with artworks of Edelfelt to co-operate, offering the same kind of partnership as the web publication has with the Finnish National Gallery. Thus, we are striving to a complete as possible site with the artworks of Edelfelt.

Albert Edelfelt is of national interest and his letters, which he mostly wrote during his stays abroad, contain information of international interest. Therefore, we plan to offer the metadata and at least some of the source material in Finnish and English translations. So far, the letters are only available as facsimile. The development of transcription programs for handwritten texts has made it probable that we in the future could include transcriptions of the letters in the web publication. Linguists especially have an interest in getting a searchable letter transcription for their researches, and the transcriptions would even be helpful for users who might have problem reading the handwritten text.

Poster [abstract]

Metadata Analysis and Text Reuse Detection: Reassessing public discourse in Finland through newspapers and journals 1771–1917

Filip Ginter¹, Antti Kanner², Leo Lahti¹, Jani Marjanen², Eetu Mäkelä², Asko Nivala¹, Heli Rantala¹, Hannu Salmi¹, Reetta Sippola¹, Mikko Tolonen², Ville Vaara², Aleksi Vesanto²

¹University of Turku; ²University of Helsinki

During the period 1771–1917 newspapers developed as a mass medium in the Grand Duchy of Finland. This happened within two different imperial configurations (Sweden until 1809 and Russia 1809–1917) and in two main languages – Swedish and Finnish. The Computational History and the Transformation of Public Discourse in Finland, 1640–1910 (COMHIS) project studies the transformation of public discourse in Europe and in Finland via an innovative combination of original data, state-of-the-art quantitative methods that have not been previously applied in this context, and an open source collaboration model.

In this study the project combines the statistical analysis of newspaper metadata and the analysis of text reuse within the papers to trace the expansion of and exchange in Finnish newspapers published in the long nineteenth century. The analysis is based on the metadata and content of digitized Finnish newspapers published by the National library of Finland. The dataset includes full text of all newspapers and most periodicals published in Finland between 1771 and 1920. The analysis of metadata builds on data harmonization and enrichment by extracting information on columns, type sets, publications frequencies and circulation records from the full-text files or outside sources. Our analysis of text reuse is based on a modified version of the Basic Local Alignment Search Tool (BLAST) algorithm, which can detect similar sequences and was initially developed for fast alignment of biomolecular sequences, such as DNA chains. We have further modified the algorithm in order to identify text reuse patterns. BLAST is robust to deviations in the text content, and as such able to effectively circumvent errors or differences arising from optical character recognition (OCR).

By relating metadata on publication places, language, number of issues, number of words, size of papers, and publishers and comparing that to the existing scholarship on newspaper history and censorship, the study provides a more accurate bird’s-eye view of newspaper publishing in Finland after 1771. By pinpointing key moments in the development of journalism the study suggest that the while the discussions in the public were inherently bilingual, the technological and journalistic developments advanced at different speeds in Swedish and Finnish language forums. It further assesses the development of the press in comparison with book production and periodicals, pointing towards a specialization of newspapers as a medium in the period post 1860. Of special interest is that the growth and specialization of the newspaper medium was much indebted to the newspapers being established all over the country and thus becoming forums for local debates.

The existence of a medium encompassing the whole country was crucial to the birth of a national imaginary. Yet, the national public sphere was not without regional intellectual asymmetries. This study traces these asymmetries by analysing text reuse in the whole newspaper corpus. It shows which papers and which cities functioned as “senders” and “receivers” in the public discourse in this period. It is furthermore essential that newspapers and periodicals had several functions throughout the period, and the role of the public sphere cannot be taken for granted. The analysis of text reuse further paints a picture of virality in newspaper publishing that was indicative of modern journalistic practices but also reveals the rapidly expanding capacity of the press. These can be further contrasted to other items commonly associated with the birth of modern journalism such as publication frequency, page sizes and typesetting of the papers.

All algorithms, software, and the text reuse database will be made openly available online, and can be located through the project’s repositories (https://comhis.github.io/ and https://github.com/avjves/textreuse-blast). The results of the text reuse detection carried out in BLAST are stored in a database and will also be made available for the exploration of other researchers.

Poster [abstract]

Oceanic Exchanges: Tracing Global Information Networks In Historical Newspaper Repositories, 1840-1914

Hannu Salmi, Mila Oiva, Asko Nivala, Otto Latva

University of Turku,

Oceanic Exchanges: Tracing Global Information Networks in Historical Newspaper Repositories, 1840-1914 (OcEx) is a Digging into Data – Transatlantic Platform funded international and interdisciplinary project with a focus on studying spreading of news globally in the nineteenth century newspapers. The project combines digitized newspapers from Europe, US, Mexico, Australia, New Zealand, and the British and Dutch colonies of that time all over the world.

The project examines patterns of information flow, spread of text reuse, and global conceptual changes across national, cultural and linguistic boundaries in the nineteenth century newspapers. The project links the different newspaper corpora, scattered into different national libraries and collections using various kinds of metadata and printed in several languages, into one whole.

The project proposes to present a poster in the Nordic Digital Humanities Conference 2018. The project started in June 2017, and the aim of the poster is to present the current status of the project.

The research group members come from Finland, the US, the Netherlands, Germany, Mexico, and UK. OcEx’s participating institutions are Loughborough University, Northeastern University, North Carolina State University, Universität Stuttgart, Universidad Nacional Autónoma de México, University College London, University of Nebraska-Lincoln, University of Turku, and Utrecht University. The project’s 90 million newspaper pages come from Australia's Trove Newspapers, the British Newspapers Archive, Chronicling America (US), Europeana Newspapers, Hemeroteca Nacional Digital de México, National Library of Finland, National Library of the Netherlands (KB), the National Library of Wales, New Zealand’s PapersPast, and a strategic collaboration with Cengage Publishing, one of the leading commercial custodians of digitized newspapers.

Objectives

Our team will hone computational tools, some developed in prior research by project partners and novel ones, into a suite of openly available tools, data, and analyses that trace a broad range of language-related phenomena (including text reuse, translational shifts, and discursive changes). Analysing such parameters enables us to characterize “reception cultures,” “dissemination cultures,” and “reference cultures” in terms of asymmetrical flow patterns, or to analyse the relationships between reporting targeted at immigrant communities and their surrounding host countries.

OcEx will leverage existing relationships and agreements between its teams and data providers to connect disparate digital newspaper collections, opening new questions about historical globalism and modeling consortial approaches to transnational newspaper research. OcEx will take up challenging questions of historical information flow, including:

1. Which stories spread between nations and how quickly?

2. Which texts were translated and resonated across languages?

3. How did textual copying (reprinting) operate internationally compared to conceptual copying (idea spread)?

4. How did the migration of texts facilitate the circulation of knowledge, ideas, and concepts, and how were these ideas transformed as they moved from one Atlantic context to another?

5. How did geopolitical realities (e.g. economic integration, technology, migration, geopolitical power) influence the directionality of these transnational exchanges?

6. How does reporting in immigrant and ethnic communities differ from reporting in surrounding host countries?

7. Does the national organization of digitized newspaper archives artificially foreclose globally-oriented research questions and outcomes?

Methodology

OcEx will develop a semantic interoperable knowledge structure, or ontology, for expressing thematic and textual connections among historical newspaper archives. Even with standards in place, digitization projects pursue differing approaches that pose challenges to integration or particular levels of analysis. In most, for instance, generic identification of items within newspapers has not been pursued. In order to build an ontology, this project will build on knowledge acquired by participating academic partners, such as the project TimeCapsule at Utrecht University, as well as analytical software that has been tested and used by team members, such as viral text analysis. OcEx does not aim to create a totalizing research infrastructure but rather to expose the conditions by which researchers can work across collections, helping guide similar projects in future seeking to bridge national collections. This ontology will be established through comparative investigations of phenomena illustrating textual links: reprinting and topic dissemination. We have divided the tasks into six work packages:

WP1: Management

➢ create an international network of researchers to discuss issues of using and accessing newspaper repository data and combine expertise toward better development and management of such data;

➢ assemble a project advisory board, consisting of representatives of public and private data custodians and other critical stakeholders.

WP2: Assessment of Data and Metadata

➢ investigate and develop classifier models of the visual features of newspaper content and genres;

➢ create a corpus of annotations on clusters/passages that records relationships among textual versions.

WP3: Creating a Networked Ontology for Research

➢ create an ontology of genres, forms, and elements of texts to support that annotation;

➢ select and develop best practices based on available technology (semantic web standard RDF, linked data, SKOS, XML markup standards such as TEI).

WP4: Textual Migration and Viral Texts

➢ analyze text reuse across archives using statistical language models to detect clusters of reprinted passages;

➢ perform analyses of aggregate information flows within and across countries, regions, and publications;

➢ develop adaptive visualization methods for results.

WP5: Conceptual Migration and Translation Shifts

➢ perform scalable multilingual topic model inference across corpora to discern translations, shared topics, topic shifts, and concept drift within and across languages, using distributional analysis and (hierarchical) polylingual topic models;

➢ analyze migration and translation of ideas over regional and linguistic borders;

➢ develop adaptive visualization methods for the results.

WP6: Tools of Delivery/Dissemination

➢ validation of test results in scholarly contexts/test sessions at academic institutions;

➢ conduct analysis of the sensitivity of results to the availability of corpora in different languages and levels of access;

➢ share findings (data structures/availability/compatibility, user experiences) with institutional partners;

➢ package code, annotated data (where possible), and ontology for public release.

Poster [abstract]

ArchiMob: A multidialectal corpus of Swiss German oral history interviews

Yves Scherrer¹, Tanja Samardžić²

¹University of Helsinki, Department of Digital Humanities; ²University of Zurich, CorpusLab, URPP Language and Space

Although dialect usage is prevalent in the German-speaking part of Switzerland, digital resources for dialectological and computational linguistic research are difficult to obtain. In this paper, we present a freely available corpus of spontaneous speech in various Swiss German dialects. It consists in transcriptions of video interviews with contemporary witnesses of the Second World War period in Switzerland. These recordings were produced by an association of Swiss historians called Archimob about 20 years ago. More than 500 informants stemming from all linguistic regions of Switzerland (German, French and Italian) and representing both genders, different social backgrounds, and different political views, were interviewed. Each interview is 1 to 2 hours long. In collaboration with the University of Zurich, we have selected, processed and analyzed a subset of 43 interviews in different Swiss German dialects.

The goal of this contribution is twofold. First, we describe how the documents were transcribed, segmented and aligned with the audio source and how we make the data available on specifically adapted corpus query engines. We also provide an additional normalization layer in order to reduce the different types of variation (dialectal, speaker-specific and transcriber-specific) present in the transcriptions. We formalize normalization as a machine translation task, obtaining up to 90% of accuracy (Scherrer & Ljubešić 2016).

Second, we show through some examples how the ArchiMob resource can shed new lights on research questions from digital humanities in general and dialectology and history in particular:

• Thanks to the normalization layer, dialect differences can be identified and compared with existing dialectological knowledge.

• Using language modelling, another technique borrowed from language technology, we can compute distances between texts. These distance measures allow us to identify the dialect of unknown utterances (Zampieri et al. 2017), localize transcriber effects and obtain a generic picture of the Swiss German dialect landscape.

• Departing from the purely formal analysis of the transcriptions for dialectological purposes, we can apply methods such as collocation analysis to investigate the content of the interviews. By identifying the key concepts and events referred to in the interviews, we can assess how the different informants perceive and describe the same time period.

Poster [abstract]

Serious gaming to support stakeholder participation and analysis in Nordic climate adaptation research

Tina-Simone Neset¹, Sirkku Juhola², Therese Asplund¹, Janina Käyhkö², carlo Navarra¹

¹Linköping University,; ²Helsinki University

Introduction

While climate change adaptation research in the Nordic context has advanced significantly in recent years, we still lack a thorough discussion on maladaptation, i.e. the unintended negative outcomes as a result of implemented adaptation measures. In order to identify and assess examples of maladaptation for the agricultural sector, we developed a novel methodology, integrating visualization, participatory methods and serious gaming. This enables research and policy analysis of trade-offs between mitigation and adaptation options, as well as between alternative adaptation options with stakeholders in the agricultural sector. Stakeholders from the agricultural sector in Sweden and Finland have been engaged in the exploration of potential maladaptive outcomes of climate adaptation measures by means of a serious game on maladaptation in Nordic agriculture, and discussed their relevance and related trade offs.

The Game

The Maladaptation Game is designed as a single player game. It is web-based and allows a moderator to collect the settings and results for each player involved in a session, store these for analysis, and display these results on a ‘moderator screen’. The game is designed for agricultural stakeholders in the Nordic countries, and requires some prior understanding of the challenges that climate change can impose on Nordic agriculture as well as the scope and function of adaptation measures to address these challenges.

The gameplay consists of four challenges, each involving multiple steps. At the start of the game, the player is equipped with a limited number of coins, which decrease for each measure that is selected. As such, the player has to consider the implications in terms of risk and potential negative effects of a selected measure as well as the costs for each of these measures. The player is challenged with four different climate related challenges – increased precipitation, drought, increased occurrence of pests and weeds, and a prolonged growing season - that are all relevant to Nordic agriculture. The player selects one challenge at a time. Each challenge has to be addressed, and once a challenge has been concluded, the player cannot return and revise the selection. When entering a challenge (e.g. precipitation) possible adaptation measures that can be taken to address this challenge in an agricultural context, are displayed as illustrated cards on the game interface. Each card can be turned to receive more information, i.e. a descriptive text and the related costs. The player can explore all cards before selecting one. The selected adaptation measure is then leading to a potential maladaptive outcome, which is again displayed as an illustrated card with an explanatory text on the backside. The player has to decide to reject or accept this potential negative outcome. If the maladaptive outcome is rejected, the player returns to the previous view, where all adaptation measures for the current challenge are displayed, and can select another measure, and make the decision whether to accept or reject the potential negative outcome that is presented for these. In order to complete a challenge, one adaptation measure with the related negative outcome has to be accepted. After completing a challenge, the player returns to the entry page, where, in addition to the overview of all challenges, a small scoreboard summarizes the selection made, displays the updated amount of coins as well as a score of maladaptation-points. These points represent the negative maladaptation score for the selected measures and are a measure that the player does not know prior to making the decision.

The game continues until selections have been made for all four challenges. At the end of the game, the player has an updated scoreboard with three main elements: the summary of the selections made for each challenge, the remaining number of coins, and the total sum of the negative maladaptation score. The scoreboards of all players involved in a session appear now on the moderator screen. This setup allows the individual player to compare his or her pathways and results with other players. The key feature of the game is hence the stimulation of discussions and reflections concerning adaptation measures and their potential negative outcomes, both with regard to adding knowledge about adaptation measures and their impact as well as the threshold of when an outcome is considered maladaptive, i.e. what trade offs are made within agricultural climate adaptation.

Preliminary conclusions from the visualization supported gaming workshops

During autumn 2016, eight gaming workshops were held in Sweden and Finland. These workshops were designed as visualization supported focus groups, allowing for some general reflections, but also individual interaction with the web-based game. Stakeholders included farmers, agricultural extension officers, and representatives of branch organizations as well as agricultural authorities on the national and regional level. Focus group discussions were recorded and transcribed in order to analyze the empirical results with focus on agricultural adaptation and potential maladaptive outcomes.

Preliminary conclusions from these workshops point towards several issues that relate both to content and functionality of the game. While, as a general conclusion, the stakeholders were able to quickly get acquainted with the game and interact without larger difficulties, some few individual participants were negative to the general idea of engaging with a game to discuss these issues. The level of interactivity that the game allows, where players can test and explore, before making a decision, enabled reflections and discussions also during the gameplay. Stakeholders frequently tested and returned to some of the possible choices before deciding on their final setting. Since the game demands the acceptance of a potential negative outcome, several stakeholders described their impression of the game as a ‘pest or cholera’ situation. In terms of empirical results, the workshops generated a large number of issues regarding the definition of maladaptive outcomes and their thresholds, in relation to contextual aspects, such as temporal and spatial scales, as well as reflections regarding the relevance and applicability of the proposed adaptation measures and negative outcomes.

Poster [abstract]

Challenges in textual criticism and editorial transparency

Elisa Johanna Veit, Pieter Claes, Per Stam

Svenska litteratursällskapet i Finland,

Henry Parlands Skrifter (HPS) is a digital critical edition of the works and correspondence of the modernist author Henry Parland (1908–1930). The poster presents chosen strategies for communicating the results of the process of textual criticism in a digital environment. How can we make the foundations for editorial decisions transparent and easily accessible to a reader?

Textual criticism is by one of several definitions “the scientific study of a text with the intention of producing a reliable edition” (Nationalencyklopedin, “textkritik”. Our translation.) When possible, the texts of the HPS edition are based on original prints whose publication was initiated by the author during his lifetime. However, rendering a reliable text largely requires a return to original manuscripts as only a fraction of Parland’s works were published before the author’s death at the age of 22 in 1930. Posthumous publications often lack reliability due to the editorial practices and sometimes primarily aesthetic solutions to text problems of later editors.

The main structure of the Parland digital edition is related to Zacharias Topelius Skrifter (topelius.sls.fi) and similar editions (e.g. grundtvigsværker.dk). However, the Parland edition has foregone the system of a – theoretically – unlimited amount of columns in favour of only two fields for text: a field for the reading text, which holds a central position on the webpage, and a smaller, optional, field containing, in different tabs, editorial commentary, facsimiles and transcriptions of manuscripts and original prints. The benefit of this approach is easier navigation. If a reader wishes to view several fields at once, they may do so by using several browser windows, which is explained in the user’s guide.

The texts of the edition are transcribed in XML and encoded following TEI (Text Encoding Initiative) Guidelines P5. Manuscripts, or original prints, and edited reading texts are rendered in different files (see further below). All manuscripts and original prints used in the edition are presented as high-resolution facsimiles. The reader thus has access to the different versions of the text in full, as a complement to the editorial commentary.

Parland’s manuscripts often contain several layers of changes (additions, deletions, substitutions): those made by the author himself during the initial process of writing or during a later revision, and those made by posthumous editors selecting and preparing manuscripts for publication. The editor is thus required to analyse the manuscripts in order to include only changes made by the author in the text of the edition. The posthumous changes are included in the transcriptions of the manuscripts and encoded using the same TEI elements as the author’s changes with an addition of attributes indicating the other hand and pen (@hand and @medium). In the digital edition these changes, as well as other posthumous markings and notes, are displayed in a separate colour. A tooltip displays the identity of the other hand.

One of the benefits of this solution is transparency towards the reader through visualization of the editor’s interpretation of all sections of the manuscript. The using of standard TEI elements and attributes facilitate possible use of the XML-documents for purposes outside of the edition. For the Parland project, there were also practical benefits concerning technical solutions and workflow in using mark-up that had already, though to a somewhat smaller extent, been used by the Zacharias Topelius edition.

The downside to using the same elements for both authorial and posthumous changes is that the XML-file will not very easily lend itself to a visualization of the author’s version. Although this surely would not be impossible with an appropriately designed stylesheet, we have deemed it more practical to keep manuscripts and edited reading texts in separate files. All posthumous intervention and associated mark-up are removed from the edited text, which has the added practical benefit of making the XML-document more easily readable to a human editor. However, the information value of the separate files is more limited than that of a single file would be.

The file with the edited text still contains the complete author’s version, according to the critical analysis of the editor. Editorial changes to the author’s text are grouped together with the original wording in the TEI-element choice and the changes are visualized in the digital edition. The changed section is highlighted and the original wording displayed in a tooltip. Thus, the combination of facsimile, transcription and edited text in the digital edition visualizes the editor’s source(s), interpretation and changes to the text.

Sources

Nationalencyklopedin, “textkritik”. http://www.ne.se/uppslagsverk/encyklopedi/lång/textkritik (accessed 2017-10-19).

Poster [publication ready]

Digitizing the Icelandic-Danish Blöndal Dictionary

Steinþór Steingrímsson

The Árni Magnússon Institute for Icelandic Studies, Iceland,

The Icelandic-Danish dictionary, compiled by Sigfús Blöndal in the early 20th century is being digitized. It is the largest dictionary ever published in Icelandic, containing in total more than 150,000 entries. The digitization work started with a pilot project in 2016 resulting in a comprehensive plan on how to carry out the task. The paper describes the ongoing work, methods and tools applied as well as the aim of the project and rationale. We opted for using OCR and not double-keying, which has become common for similar projects. First results suggest the outcome is satisfactory, as the final version will be proofread. The entries are annotated with XML-entities, using a workbench built for the project. We apply automatic annotation for the most consistent entities, but other annotation is carried out manually. The data is then exported into a relational database, proofread and finally published. Publication date is set for spring 2020.

Poster [abstract]

Network visualization for historical corpus linguistics: externally-defined variables as node attributes

Timo Korkiakangas

University of Oslo,

In my poster presentation, I will explore whether and how network visualization can benefit philological and historical-linguistic research. This will be implemented by examining the usability of network visualization for the study of early medieval Latin scribes' language competences. Thus, the scope is mainly methodological, but the proposed methodological choices will be illustrated by applying them to a real data set. Four linguistic variables extracted corpus-linguistically from a treebank will be examined: spelling correctness, classical Latin prepositions, genitive plural form, and <ae> diphthong. All the four are continuous, which is typical of linguistic variables. The variables represent different domains of language competence of the scribes who learnt written Latin practically as a second-language by that time. Even more linguistic features will be included in the analysis if my ongoing project proceeds as planned.

Thus, the primary objective of the study is to find out whether the network visualization approach has demonstrable advantages compared to ordinary cross-tabulations as far as support to philological and historical-linguistic argumentation is concerned. The main means of visualization will be the gradient colour palette in Gephi, a widely used open-source network analysis and visualization software package. As an inevitable part of the described enterprise, it is necessary to clarify the scientific premises for the use of network environment to display externally-defined values of linguistic variables. It is obvious that in order to be utilized for research purposes, network visualization must be as objective and replicable as possible.

By way of definition, I emphasize that the proposed study will not deal with linguistic networks proper, i.e. networks which are directly induced or synthesized from a linguistic data set and represent abstract relations between linguistic units. Consequently, no network metric will be calculated, even though that might be interesting as such. What will be visualized are the distributions of linguistic variables that do not arise from the network itself, but are derived externally from a medium-sized treebank by exploiting its lemmatic, morphological, and, hopefully, also syntactic annotation layers. These linguistic variables will be visualized as attributes of the nodes in the trimodal "social" network which consists of the documents, persons, and places that underlie the treebank. These documents, persons, and places are encoded as the metadata in the treebank. The nodes are connected to each other by unweighted edges. The number of document nodes is 1,040, scribe nodes 220, and writing place nodes 84. In most cases, the definition of the 220 writer nodes is straightforward, given that the scribes scrupulously signed what they wrote, with the exception of eight documents. The place nodes are more challenging. Although 78% of the documents has been written in the city of Lucca, the disambiguation and re-grouping of small localities of which little is known was time-consuming and the results not always fully satisfying. The nodes will be set on the map background by utilizing Gephi's Geo Layout and Force Atlas 2 algorithms.

The linguistic features that will be visualized reflect the language change that took place in late Latin and early medieval Latin, roughly the 3rd to 9th centuries AD. The features are operationalized as variables which quantify the variation of those features in the treebank. This quantification is based on the numerical output of a plethora of corpus-linguistic queries which extract from the treebank all constructions or forms that meet the relevant criteria. The variables indicate the relative frequency of the examined features in each document, scribe, and writing place. For the scribes and writing places, the percentages are calculated by counting the occurrences within all the documents written by that scribe or in that place, respectively.

The resulting linguistic variables are continuous, hence the practicality of the gradient colouring. In order to ground colouring in the statistical dispersion of the variable values and to conserve maximal visual effect, I customize the Gephi default red-yellow-blue palette so that the maximal yellow, which stands for the middle of the colour scale, marks the mean of the distribution of each variable. Likewise, the thresholds of the maximal red and maximal blue are set equally far from the mean. I chose that distance to be two standard deviations away from the mean. In this way, only around 2.5% of the nodes with the lowest and highest values at both ends of the distribution are maximally saturated with red and blue while the rest, around 95%, of the nodes features a gradient colour, including the maximal yellow in the between. Following this rule, I will illustrate the variables both separately and as a sum variable. The images will be available in the poster. The sum variable will be calculated by aggregating the standardized simple variables.

The preliminary conclusions include the observation that network visualization, as such, is not a sufficient basis for philological or historical-linguistic argumentation, but if used along with statistical approach, it can support argumentation by drawing attention to unexpected patterns and – on the other hand – to irregularities. However, it is the geographical layout of the graphs that gives the most of the surplus in regard to traditional approaches: it helps in perceiving patterns that would have otherwise failed to be noticed.

The treebank on which the analyses are based is the Late Latin Charter Treebank (version 2, LLCT2), which consists of 1,040 early medieval Latin documentary texts (c. 480,000 words). The documents have been written in historical Tuscia (Tuscany), Italy, between AD 714 and 897, and are mainly sale or purchase contracts or donations, accompanied by a few judgements as well as lists and memoranda. LLCT2 is still under construction and only the first half of it is already provided with the syntactically annotated layer, thus making it a treebank proper (i.e. LLCT, version 1). The lemmatization and morphological annotation style are based on the Ancient Greek and Latin Dependency Treebank (AGLDT) style which can be deduced from the Guidelines for the Syntactic Annotation of Latin Treebanks. Korkiakangas & Passarotti (2011) define a number of additions and modifications to these general guidelines which are designed for Classical Latin. For a more detailed description of the LLCT2 and the underlying text editions, see Korkiakangas (in press). Documents are privileged material for examining the spoken/written interface of early medieval Latin, in which the distance between the spoken and written codes had grown considerable by the Late Antiquity. The LLCT2 documents have precise dating and location metadata and they survive as originals.

Bibliography

Adams J.N. Social variation and the Latin language. Cambridge University Press (Cambridge), 2013.

Araújo T. and Banisch S. Multidimensional Analysis of Linguistic Networks. Mehler A., Lücking A., Banisch S., Blanchard P. and Job, B. (eds) Towards a Theoretical Framework for Analyzing Complex Linguistic Networks. Springer (Berlin, Heidelberg), 2016, 107-131.

Bamman D., Passarotti M., Crane G. and Raynaud S. Guidelines for the Syntactic Annotation of Latin Treebanks (v. 1.3), 2007 http://nlp.perseus.tufts.edu/syntax/treebank/ldt/1.5/docs/guidelines.pdf.

Barzel B. and Barabási A.-L. Universality in network dynamics. Nature Physics. 2013;9:673-681.

Bergs A. Social Networks and Historical Sociolinguistics: Studies in Morphosyntactic Variation in the Paston Letters. Walter de Gruyter (Berlin), 2005.

Ferrer i Cancho R. Network theory. Hogan P.C. (ed.) The Cambridge Encyclopedia of the Language Sciences. Cambridge University Press (Cambridge), 2010, 555–557.

Korkiakangas T. (in press) Spelling Variation in Historical Text Corpora: The Case of Early Medieval Documentary Latin. Digital Scholarship in the Humanities.

Korkiakangas T. and Lassila M. Abbreviations, fragmentary words, formulaic language: treebanking medieval charter material. Mambrini F., Sporleder C. and Passarotti M. (eds) Proceedings of the Third Workshop on Annotation of Corpora for Research in the Humanities (ACRH-3), Sofia, December 13, 2013. Bulgarian Academy of Sciences (Sofia), 2013, 61-72.

Korkiakangas T. and Passarotti M. Challenges in Annotating Medieval Latin Charters. Journal of Language Technology and Computational Linguistics. 2011;26,2:103-114.

Poster [abstract]

Approaching a digital scholarly edition through metadata

Katarina Pihlflyckt

Svenska litteratursällskapet i Finland r.f.

This poster presents a flowchart with an overview of the database structure in the digital critical edition of Zacharias Topelius Skrifter (ZTS). It shows how the entity relations open a possibility for the user to approach the edition from other angles than the texts, using informative metadata through indexing systems. Through this data, a historian can easily capture for example events, meetings between people or editions of books, as they are presented in Zacharias Topelius’ (1818–1898) texts. Presented here are both already available features and features in progress.

ZTS comprises eight digital volumes hitherto, the first published in 2010. This includes the equivalent of about 8 500 pages of text by Topelius, 600 pages of introduction by editors and 13 000 annotations. The published volumes cover poetry, short stories, correspondences, children’s textbooks, historical-geographical works and university lectures on history and geography. It is freely accessible at topelius.sls.fi. Genres still to be published include children’s books, novels, journalism, academica, diaries and religious texts.

DATABASE STRUCTURE

The ZTS database structure consists of six connected databases: people, places, bibliography, manuscripts, letters and a chronology. So far, the people database consists of about 10 000 unique persons, and a possibility to link them to a family or group level (250 records). It has separate chapters for mythological persons (500 records) and fictive characters (250 records). The geographic database has 6 000 registered places. The bibliographic database has 6 000 editions divided on 3 500 different works, and the manuscript database has 1 400 texts on 350 physical manuscripts. The letter database has 4 000 registered letters to and from Topelius, divided on 2 000 correspondences. The chronology of Topelius life has 7 000 marked events. The indexing of objects started in 2005, using the FileMaker system. New records are continuously added and the work with finding more possibilities on how to use, link and present the data is in constant progress. The users can freely access the information in database records that link to the published volumes.

The bibliographic database is the most complex database. The structure follows the Functional Requirements for Bibliographic Records (FRBR) model, which means we are making a difference between the abstract work and the published manifestations (editions) of that work. The FRBR focuses on the content relationship and continuum between the levels; anything regarded a separate work starts as a new abstract record, from where its own editions are created. Within ZTS, the abstract level has a practical significance, in cases when it is impossible to determine to which exact edition Topelius is referring. Also taken in consideration is that for example articles and short stories can have their own independent editions as well as being included in editions (e.g. a magazine, an anthology). This requires two different manifestation levels subordinated the abstract level; the regular editions and the texts included in other editions, the records of the latter type must always link to records of the former.

The manuscript database has a content relationship to the bibliographic database through the abstract entity of a work. A manuscript text can be regarded as an independent edition of a work in this context (a manuscript that was never published can easily have a future edition added in the bibliographic database). The manuscript text itself might share physical paper with another manuscript text. Therefore, the description of the physical manuscript is created on a separate level in the manuscript database, to which the manuscript text is connected.

The letter database follows the FRBR model; an upper level presents the whole correspondence between Topelius and another person, and a subordinated level describes each physical letter within the correspondence. It is possible to attach additional corresponding persons to occasional letters.

The people database connects to the letter database and the bibliographic database, creating a one-to-many relationship. Any writer or author has to be in the people database in order to have their information inserted into these two databases. Within the people database there is also a family or group level, where family members can be grouped, but in contrary to the letter database, this is not a superordinate level.

The geographic database follows a one-level structure. Places in letters and manuscripts can be linked from the geographic database.

The chronology database contains manually added key events from Topelius’ life, as well as short diary entries made by him in various calendars during his life. It also has automatically gathered records from other databases, based on marked dates when Topelius works were published or when he wrote a letter or a manuscript. The dates of birth and/or death of family members and close friends can be linked from the people database.

POSSIBILITIES FOR THE USER

Approaching a digital scholarly edition with over 8 500 pages can be a heavy task, and many will likely use the edition more as an object to study, rather than texts to read. For a user not familiar with the content of the different volumes, but still looking for specific information, advanced searches and indexing systems offer a faster path into the relevant text passages. The information in the ZTS database records provides a picture of Finland in the 19th century as it appears in Topelius’ works and life. A future feature for users is access to this data through an API (Application Programming Interface). This will create opportunities for the user to take advantage of the data in any wanted way: to create a 19th century bookshelf, an app for the most popular 19th century names or a map of popular student hangouts in 1830’s Helsinki.

Through the indexes formed by the linked data from the texts, the user can find all the occurrences of a person, a place or a book in the whole edition. One record can build a set of ontological relations, and the user can follow a theme, while moving between texts. A search for a person will provide the user with information about where Topelius mentions this person, whether it is in a letter, in his diaries or in a textbook for schoolchildren, or if he possibly meets or interacts with the person. Furthermore, the user can see if this person was the author, publisher or perhaps translator of a book mentioned by Topelius in his texts, or if the editors of ZTS have used the book as a source for editorial comments. The user will also be able to get a list of letters the person wrote to or received from Topelius. The geographic index can help the user create a geographic ontology with an overview of Topelius’ whereabouts through the annotated mentions of places in Topelius’ diaries, letters and manuscripts.

The chronology creates a base for a timeline that will not only give the user key events from Topelius’ life, but also links to the other database records. Encoded dates in the XML files (letters, diaries, lectures, manuscripts etc.) can lead the user directly to the relevant text passages.

The relation between the bibliographic database and the manuscript database creates a complete bibliography over everything Topelius wrote, including all known manuscripts and editions that relate to a specific work. So far, there are 900 registered independent works by Topelius in the bibliographic database; these works are implemented in 300 published editions (manifestations) and 2 900 text versions included in those manifestations or in other independent manifestations. The manuscript database consists of 1 400 manuscript texts. The FRBR model offers different ways of structuring the layout of a bibliography according to the user’s needs, either through the titles of the abstract works with subordinate manifestations, or directly through the separate manifestations. The bibliography can be limited to show only editions published during Topelius’ lifetime, or to include later editions as well. Furthermore, the bibliography points the user to the published texts and manuscripts of a specific work in the ZTS edition and to text passages where the author himself discusses the work in question.

The level of detail is high in the records. For example, we register different name forms and spellings (Warschau vs Warszawa). Such information is included in the index search function and thereby eliminates problems for the end user trying to find information. Topelius often uses many different forms and abbreviations, and performing an advanced search in the texts would seldom give a comprehensive result in these cases. The letter database includes reference words describing the contents of the correspondences. Thus, the possibilities for searching in the material are expanded beyond the wordings of the original texts.

Poster [publication ready]

A Tool for Exploring Large Amounts of Found Audio Data

Per Fallgren, Zofia Malisz, Jens Edlund

KTH Royal Institute of Technology,

We demonstrate a method and a set of open source tools (beta) for non-sequential browsing of large amounts of audio data. The demonstration will contain first versions of a set of varied functionalities in their first stages, and will provide a good insight in how the method can be used to browse through large quantities of audio data efficiently.

Poster [publication ready]

The PARTHENOS Infrastructure

Sheena Dawn Bassett

PIN SCrl,

PARTHENOS around two ERICs from the Humanities and Arts sector, DARIAH and CLARIN, along with ARIADNE, EHRI, CENDARI, CHARISMA and IPERION-CH and will deliver guidelines, standards, methods, pooled services and tools to be used by its partners and all the research community. Four broad research communities are addressed – History, Linguistic Studies, Archaeology, Heritage and Applied Disciplines and the Social Sciences. By identifying the common needs, PARTHENOS will support cross disciplinary research and provide innovative solutions.

By applying the FAIR data principles to structure the work on common policies and standards, the project has produced tools to assist researchers to find and apply the appropriate ones for their areas of interest. A virtual research environment will enable the discovery and use of data and tools and further support is provided with a set of online training modules.

Poster [abstract]

Using rolling.classify on the Sagas of Icelanders: Collaborative Authorship in Bjarnar saga Hítdælakappa

Daria Glebova

Russian State Academy of Science, Institute of Slavonic Studies

This poster will present the results of an application of the rolling.classify function in Stylo (R) to the source with an unknown authorship and extremely poor textual history – Bjarnar saga Hítdælakappa, one of the medieval Sagas of Icelanders. This case study sets the usual for Stylo authorship attribution goal aside and concentrates on the composition of the main witness of Bjarnar saga, ms. AM 551 d α, 4to (17th c.), which was the source for the most of Bjarnar saga existing copies. It aims not only to find and visualise new arguments for the working hypothesis about the AM 551 d α, 4to composition but also to touch upon main questions that rise before a student of philology daring to use Stylo on the Old Icelandic saga ground, i.e. what Stylo tells us, what it does not, and how can one use it while exploring the history of a text that exists only in one source.

It has been noticed that Bjarnar saga shows signs of a stylistic change between the first 10 chapters and the rest of the saga – the characters suddenly change their behaviour (Sígurður Nordal 1938, lxxix; Andersson 1967, 137-140), the narrative becomes less coherent and, as it seems, acquires a new logic of construction (Finlay 1990-1993, 165-171). More detailed narrative analysis of the saga showed that there is a difference in the usage of some narrative techniques in the first and the second parts, i.e., for example, the narrator’s work with point of view and the amount of their intervention in the saga text (Glebova 2017, 45-57). Thus, the question is – what is the relationship between the first 10 chapters and the rest of Bjarnar saga? Is the change entirely compositional and motivated by the narrative strategy of the medieval compiler or it is actually a result of a compilation of two texts that have two different authors?

As it often happens with sagas, the problem aggravates due to the Bjarnar saga poor preservation. There is not much to compare and work with; the most of the saga witnesses are copies from one 17th c. manuscript, AM 551 d α, 4to (Boer 1893, xii-xiv; Sígurður Nordal 1938, xcv-xcvii; Simon 1966 (I), 19-149). This manuscript also has its flaws as it has two lacunae, one in the very beginning of the saga (ch. 1-5,5 in ÍF III) and another in the middle (between ch. 14-15 in ÍF III). The second lacuna is unreconstructable while the first one is usually substituted by a fragment from the saga’s short reduction that was preserved in copies of 15th c. kings’ saga compilation, Separate saga St. Olaf in Bœjarbók (Finlay 2000, xlvi), and that actually ends right on the 10th chapter of the longer version. It seems that the text of the shorter version is a variant of the longer one (Glebova 2017, 13-17) and it has a reference that there has been more to the story but it was shortened; precise relationships between the short and long reductions, however, are impossible to reconstruct due to the lacuna in AM 551 d α, 4to. The existence of the short version with these particular length and contents is indeed very important to the study of Bjarnar saga composition in AM 551 d α, 4to as it creates a chance that the first 10 chapters of AM 551 d α, 4to could exist separately at some point of the Bjarnar saga’s text history or at least that these chapters were seen by the medieval compilers as something solid and complete. This would be the last word of the traditional philology concerning this case – the state of the sources does not allow saying more. Thus, is there anything else that could shed some light on the question whether these chapters existed separately or they were written by the same hand?

In this study it was decided to try sequential stylometric analysis available in Stylo package for R (Eder, Kestemont, Rybicki 2013) as a function rolling.classify (Eder 2015). As we are interested in the different parts of the same text, rolling stylometry seems to be a more preferable method than cluster analysis, which takes the whole text as an entity and compares it to the reference corpus; alternatively, in case with rolling stylometry the text is divided into smaller segments that allows a deeper investigation of the stylistic variation in the text itself (Rybicki, Eder, Hoover 2016, 126). To do the analysis there was made a corpus from the two parts of Bjarnar saga and several other Old Icelandic sagas; the whole corpus was taken from sagadb.org in Modern Icelandic normalised orthography. Several tests were conducted, first, with one of the parts as a test set and then with another; a sample size from 5000 words to 2000. The preliminary results show that there is a stylistic division in the saga as the style of the first part is not present in the second one and vice versa.

This would be an additional argument for the idea that the first 10 chapters existed separately and were added by the Bjarnar saga compiler during the saga construction. One could argue that it could be not an authorial but a generic division as the first part is set in Norway and deals a lot with St. Olaf; the change of genre could result in the change of style. However, Stylo counts the most frequent words, which are not so generically specific (like og, að, etc.); thus, the collaborative authorship still could have taken place. This would be an important result in context of the overall composition of the Bjarnar saga longer version as its structure shows traces of a very careful planning and also mirror composition (Glebova 2017, 18-33): could it be that the structure of one of the parts (maybe, the first one) influenced the other? Whatever be the case, while sewing together the existing material, the medieval compiler made an effort to create a solid text and this effort is worth studying with more attention.

Bibliography:

Andersson, Theodor M. (1967). The Icelandic Family Saga: An Analytic Reading. Cambridge, MA.

Boer, Richard C. (1893). Bjarnar saga Hítdælakappa, Halle.

Eder, M. (2015). “Rolling Stylometry.” Digital Scholarship in the Humanities, Vol. 31-3: 457–469.

Eder, M., Kestemont, M., Rybicki, J. (2013). “Stylometry with R: A Suite of Tools.” Digital Humanities 2013: Conference Abstracts. University of Nebraska–Lincoln: 487–489.

Finlay, A. “Nið, Adultery and Feud in Bjarnar saga Hítdælakappa.” Saga-Book of the Viking Society 23 (1990-1993): 158-178.

Finlay, A. The Saga of Bjorn, Champion of the Men of Hitardale, Enfield Lock, 2000.

Glebova D. A Case of An Odd Saga. Structure in Bjarnar saga Hítdælakappa. MA thesis, University of Iceland. Reykjavík, 2017 (http://hdl.handle.net/1946/27130).

Rybicki, J., Eder, M., Hoover, David L. “Computational Stylistics and Text Analysis.” In Doing Digital Humanities: Practice, Training, Research, edited by Constance Compton, Richard J. Lane, Ray Siemens. London, New York: 123-144.

Sigurður Nordal, and Guðni Jónsson (eds.) “Bjarnar saga Hítdælakappa.” In Borgfirðinga sögur, Íslenzk fornrit 3, 111-211. Reykjavík, 1938.

Simon, John LeC. A Critical Edition of Bjarnar saga Hítdælakappa. Vol. 1-2. Unpublished PhD thesis, University of London, 1966.

Poster [abstract]

The Bank of Finnish Terminology in Arts and Sciences – a new form of academic collaboration and publishing

Johanna Enqvist, Tiina Onikki-Rantajääskö

University of Helsinki,

This presentation concerns the multidisciplinary research infrastructure project “Bank of Finnish Terminology in Arts and Sciences (BFT)” as an innovative form of academic collaboration and publishing. The BFT, which was launched in 2012, aims to build a permanent and continuously updated terminological database for all fields of research in Finland. Content for the BFT is created by niche-sourcing, where the participation is limited to a particular group of experts in the participating subject fields. The project maintains a wiki-based website which offers an open and collaborative platform for terminological work and a discussion forum available to all registered users.

The BFT thus opens not only the results but the whole academic procedure where the knowledge is constantly produced, evaluated, discussed and updated in an ongoing process. The BFT also provides an inclusive arena for all the interested people – students, journalists, translators and enthusiasts – to participate in the discussions relating to concepts and terms in Finnish research. Based on the knowledge and experiences accumulated during the BFT project we will reflect on the benefits, challenges, and future prospects of this innovative and globally unique approach. Furthermore, we will consider the possibilities and opportunities opening up especially in terms of digital humanities.

Poster [publication ready]

The Swedish Language Bank 2018: Research Resources for Text, Speech, & Society

Lars Borin¹, Markus Forsberg¹, Jens Edlund², Rickard Domeij³

¹University of Gothenburg; ²KTH Royal Institute of Technology; ³The Institute for Language and Folklore

We present an expanded version of the Swedish research resource the Swedish Language Bank. The Language Bank, which has supported national and inter-national research for over four decades, will now add two branches, one focus-ing on speech and one on societal aspect of language, to its existing organiza-tion, which targets text.

Poster [abstract]

Handwritten Text Recognition and 19th Century Court Records

Maria Kallio

National Archives Finland,

This paper will demonstrate how the READ project is developing new technologies that will allow computers to automatically process and search handwritten historical documents. These technologies are brought together in the Transkribus platform, which can be downloaded free of charge at https://transkribus.eu/Transkribus/. Transkribus enables scholars with no in-depth technological knowledge to freely access and exploit algorithms which can automatically process handwritten text. Although there is already a rather sound workflow in place, the platform needs human input in order to ensure the quality of the recognition. The technology must be trained by being shown examples of images of documents and their accurate transcriptions. This helps it to understand the patterns which make up characters and words. This training data is used to create a Handwritten Text Recognition model which is specific to a particular collection of documents. The more training data there is, the more accurate the Handwritten Text Recognition can become.

Once a Handwritten Text Recognition model has been created, it can be applied to other pages from the same collection of documents. The machine analyses the image of the handwriting and then produces textual information about the words and their position on the page, providing best guesses and alternative suggestions for each word, with measures of confidence. This process allows Transkribus to provide the automatic transcription and full-text search of a document collection at high levels of accuracy.

For the quality of the text recognition, the amount of training material is paramount. Current tests suggest that models for specific style of handwriting can reach a Character Error Rate of less than 5%. Transcripts with a Character Error Rate of 10% or below can be generally understood by humans and used for adequate keyword searches. A low Character Error Rate also makes it relatively quick and easy for human transcribers to correct the output of the Handwritten Text Recognition engine. These corrections can then be fed back into the model in order to make it more accurate. These levels also compare favorably with Optical Character Recognition, where 95-98% accuracy for early prints is possible.

Of even more interest is the fact that a well-trained model is able to sustain a certain amount of differences in handwriting. Therefore, it can be expected that, with a large amount of training material, it will be possible to recognize the writing of an entire epoch (e.g. eighteenth-century English writing), in addition to that of specific writers.

The case study of this paper is the Finnish court records from the 19th century. The notification records which contain cases concerning guardianships, titles and marriage settlements, form an enormous collection of over 600 000 pages. Although the material is in digital form, the usability is still poor due to the lack of indices or finding aids. With the help of the Handwritten Text Recognition the National Archives have the chance to provide the material in computer-readable form which allows users to search and use the records in whole new way.

Poster [publication ready]

An approach to unsupervised ontology term tagging of dependency-parsed text using a Self-Organizing Map (SOM)

Seppo Nyrkkö

University of Helsinki

Tagging ontology-based terms on existing text content is a task often requiring human effort. Each ontology may have their own structure and schema for describing terms, making automation non-trivial. I suggest a machine learning estimation technique for term tagging which can learn semantic tagging from a set of sample ontologies with given textual examples, and expand its use for analyzing a large text corpus by comparing the found syntactic features in the text. The tagging technique is based on a dependency parsed text input and an unsupervised machine learning model, the Self-Organizing Map (SOM).

Poster [abstract]

Comparing Topic Model Stability Between Finnish, Swedish and French

Simon Hengchen, Antti Kanner, Eetu Mäkelä, Jani Marjanen

University of Helsinki

Comparing Topic Model Stability Between Finnish, Swedish and French

1 Abstract

In the recent years, topic modelling has gained increasing attention in the humanities.

Unfortunately, little has been done to determine whether the output produced by this range of probabilistic algorithms is revealing signal or merely producing noise, nor how well it performs on other languages than English.

In this paper, we set out to compare topic models of parallel corpora in Finnish, Swedish, and French, and propose a method to determine how well the topic modelling algorithms perform on those languages.

2 Context

Topic modelling (TM) is a well-known (following the work of (4; 5)) yet badly understood range of algorithms within the humanities.

While a variety of studies within the humanities make use of topic models to answer historical questions (see (2) for a thorough survey), there is no tried and true method that ascertains that the probabilistic algorithm reveals signal and is not merely responding to noise.

The rule of thumb is generally that if the results are interesting and reveal a prior intuition by a domain expert, they are considered correct -- in the sense

that they are a valid entry point into a humongous dataset, and that the proper work of historical research is to be then manually carried out on a subset selected by the algorithm.

As pointed out in previous work (7; 3), this, combined with the fact that many humanistic corpora are on the small side, "the threshold for the utility of topic modelling across DH projects is as yet highly unclear."

Similarly, topic instability "may lead to research being based on incorrect foundational assumptions regarding the presence or clustering of conceptual fields on a body of work or source material" (3).

Whilst topic modelling techniques are considered language-independent, i.e. "use[] no manually constructed dictionaries, knowledge bases, semantic networks, grammars, syntactic parsers, or morphologies, or the like" (6), they encode keyassumptions about the statistical properties of language.

These assumptions are often developed with English in mind and generalised to other languages without much consideration.

We maintain that these algorithms are not language-independent, but language-agnostic at best, and that accounting for discrepancies in how different languages are processed by the same algorithms is necessary basic research for more applied, context-oriented research -- especially for the historical development of public discourses in multilingual societies or phenomena where structures of discourse flow over language borders.

Indeed, some languages heavily rely on compounding -- the creation of a word through the combination of two or more stems -- in word formation, while others use determiners to combine simple words.

If one considers a white space as the delimitation between words (as is usually done with languages making use of the Latin alphabet), the first tendency results in a richer vocabulary than the second, hence influencing TM algorithms that follow of the bag-of-words approach.

Similarly, differences in grammar -- for example, French adjectives must agree in gender and number with the noun they modify, something that does not exist in other languages like English -- reinforce those discrepancies.

Nonetheless, most of this happens in the fuzzy and non-standard preprocessing stage of topic modelling, and the argument could be made that the language neutrality of TM algorithms rests more on it being underspecified with regard to how to pre-process the language.

In this paper, we propose to compare topic models on a custom-made parallel corpus in Finnish, Swedish, and French.

By selecting those languages, we have a glimpse of how a selection of different languages are processed by TM algorithms.

While concentrating on languages spoken in Europe and languages of interest of our collaborative network of linguists, historians and computer scientists, we are still able examine two crucial variables: one of genetic and one of cultural relatedness.

French and Swedish belong to Indo-European (Romance and Germanic branches, respectively) and Finnish is a Finno-Ugrian language.

Finnish and Swedish on the other hand share a long history of close language contact and cultural convergence.

Because of this, Finnish contains a large number of Swedish loan words, and, perceivably, similar conceptual systems.

3 Methodology

To explore our hypothesis, we use a parallel corpus of born-digital textual data in Finnish, Swedish, and French.

Once the corpus is constituted, it becomes possible to apply LDA (1) and HDA (9) -- LDA is parametrised by humans, whereas HDA will attempt to automatically determine the best configuration possible.

The resulting models for each language are stored, the corpora reduced in size, LDA is re-applied, the models are stored, corpora re-reduced, etc.

Topic models are compared manually between languages at each stage, and programmatically between stages, using the Jaccard Index (8), for all languages.

The same workflow is then applied to the lemmatised version of the above-mentioned corpora, and results compared.

Bibliography

[1] Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993{1022 (2003)

[2] Brauer, R., Fridlund, M.: Historicizing topic models, a distant reading of topic modeling texts within historical studies. In: International Conference on Cultural Research in the context of \Digital Humanities", St. Petersburg: Russian State Herzen University (2013)

[3] Hengchen, S., O'Connor, A., Munnelly, G., Edmond, J.: Comparing topic model stability across language and size. In: Proceedings of the Japanese Association for Digital Humanities Conference 2016 (2016)

[4] Jockers, M.L.: Macroanalysis: Digital methods and literary history. University of Illinois Press (2013)

[5] Jockers, M.L., Mimno, D.: Significant themes in 19th-century literature. Poetics 41(6), 750{769 (2013)

[6] Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse processes 25(2-3), 259{284 (1998)

[7] Munnelly, G., O'Connor, A., Edmond, J., Lawless, S.: Finding meaning in the chaos (2015)

[8] Real, R., Vargas, J.M.: The probabilistic basis of jaccard's index of similarity. Systematic biology 45(3), 380{385 (1996)

[9] Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. Journal of the American Statistical Association 101(476), 1566{1581 (2006)

Poster [abstract]

ARKWORK: Archaeological practices and knowledge in the digital environment

Suzie Thomas², Isto Huvila¹, Costis Dallas³, Rimvydas Laužikas⁴, Antonia Davidovic⁹, Arianna Traviglia⁶, Gísli Pálsson⁷, Eleftheria Paliou⁸, Jeremy Huggett⁵, Henriette Roued⁶

¹Uppsala University,; ²University of Helsinki; ³University of Toronto; ⁴Vilnius University; ⁵University of Glasgow; ⁶University of Venice; ⁷Umeå University; ⁸University of Copenhagen; ⁹Independent researcher

Archaeology and material cultural heritage have often enjoyed a particular status as a form of heritage that has captured the public imagination. As researchers from many backgrounds have discussed, it has become the locus for the expression and negotiation of European, local, regional, national and intra-national cultural identities, for public policy regarding the preservation and management of cultural resources, and for societal value in the context of education, tourism, leisure and well-being. The material presence of objects and structures in European cities and landscapes, the range of archaeological collections in museums around the world, the monumentality of the major archaeological sites, and the popular and non-professional interest in the material past are only a few of the reasons why archaeology has become a linchpin in the discussions on how emerging digital technologies and digitization can be leveraged for societal benefit. However, at the time when nations and the European community are making considerable investments in creating technologies, infrastructures and standards for digitization, preservation and dissemination of archaeological knowledge, critical understanding of the means and practices of knowledge production in and about archaeology from complementary disciplinary perspectives and across European countries remains fragmentary, and in urgent need of concertation.

In contrast to the rapid development of digital infrastructures and tools for archaeological work, relatively little is known about how digital information, tools and infrastructures are used by archaeologists and other users and producers of archaeological information such as archaeological and museum volunteers, avocational hobbyists, and others. Digital technologies (infrastructures, methods and resources) are reconfiguring aspects of archaeology across and beyond the lifecycle (i.e., also "in the wild"), from archaeological data capture in fieldwork to scholarly publication and community access/entanglement.Both archaeologists and researchers in other fields, from disciplines such as museum studies, ethnology, anthropology, information studies and science and technology studies have conducted research on the topic but so far, their efforts have tended to be somewhat fragmented and anecdotal. This is surprising, as the need of better understanding of archaeological practices and knowledge work has been identified for many years as a major impediment to realizing the potential of infrastructural and tools-related developments in archaeology. The shifts in archaeological practice, and in how digital technology is used for archaeological purposes, calls for a radically transdisciplinary (if not interdisciplinary) approach that brings together perspectives from reflexive, theoretically and methodologically-aware archaeology, information research, and sociological, anthropological and organizational studies of practice.

This poster presents the COST Action “Archaeological practices and knowledge work in the digital environment” (http://www.cost.eu/COST_Actions/ca/CA15201 - ARKWORK), an EU-funded network which brings together researchers, practitioners, and research projects studying archaeological practices, knowledge production and use, social impact and industrial potential of archaeological knowledge to present and highlight the on-going work on the topic around Europe.

ARKWORK (https://www.arkwork.eu/) consists of four Working Groups (WGs), with a common objective to discuss and practice the possibilities for applying the understanding of archaeological knowledge production to tackle on-going societal challenges and the development of appropriate management/leadership structures for archaeological heritage. The individual WGs have the following specific but complementary themes and objectives:

WG1 - Archaeological fieldwork

Objectives: To bring together and develop the international transdisciplinary state-of-the-art of the current multidisciplinary research on archaeological fieldwork. How archaeologists are conducting fieldwork and documenting their work and findings in different countries and contexts and how this knowledge can be used to make contributions to developing fieldwork practices and the use and usability of archaeological documentation by the different stakeholder groups in the society.

WG2 - Knowledge production and archaeological collections

Objectives: To integrate and push forward the current state-of-the-art in understanding and facilitating the use and curation of (museum) collections and repositories of archaeological data for knowledge production in the society.

WG3 - Archaeological knowledge production and global communities

Objectives: To bring together and develop the current state-of-the-art on the global communities (including indigenous communities, amateurs, neo-paganism movement, geographical and ideological identity networks and etc.) as producers and users in archaeological knowledge production e.g. in terms of highlighting community needs, approaches to communication of archaeological heritage, crowdsourcing and volunteer participation.

WG4 - Archaeological scholarship

Objectives: To integrate and push forward the current state-of-the-art in study of archaeological scholarship including academic, professional and citizen science based scientific and scholarly work.

In our poster we outline each of the working groups and provide a clear overview of the purposes and aspirations of the COST Action Network ARKWORK

Poster [publication ready]

Research and development efforts on the digitized historical newspaper and journal collection of The National Library of Finland

Kimmo Kettunen, Mika Koistinen, Teemu Ruokolainen

University of Helsinki, Finland,

The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12 million pages mainly in Finnish and Swedish. Out of these about 5.1 million pages are freely available on the web site digi.kansalliskirjasto.fi (Digi). The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1920. The last ten years, 1911–1920, were opened in February 2017.

The digitized collection of NLF is part of globally expanding network of library produced historical data that offers researchers and lay persons insight into past. In 2012 it was estimated that there were about 129 million pages and 24 000 titles of digitized newspapers in Europe [1]. A very conservative estimation about worldwide number of titles is 45 000 [2]. The current number of available data is probably already much bigger, as the national libraries have been working steadily with digitization both in Europe, Northern America and rest of the world.

This paper presents work that has been carried out in the NLF related to the historical newspaper and journal collection. We offer an overall account of research and development related to the data.

Poster [abstract]

Medieval Publishing from c. 1000 to 1500

Samu Kristian Niskanen, Lauri Iisakki Leinonen

Helsinki University

Medieval Publishing from c. 1000 to 1500 (MedPub) is a five-year project funded by the European Research Council, based at Helsinki University, and running from 2017 2022. The project seeks to define the medieval act of publishing, focusing on Latin authors active during the period from c. 1000 to 1500. A part of the project is to establish a database of networks of publishing. The proposed paper will discuss the main aspects of the projected database and the process of data-gathering.

MedPub’s research hypothesis is that publication strategies were not a constant but were liable to change, and that different social, literary, institutional, and technical milieux fostered different approaches to publishing. As we have already proved this proposition, the project is now advancing toward the next step, the ultimate aim of which is to complement the perception of societal and cultural changes that took place during the period from c. 1000 and 1500.

For the purposes of that undertaking, we define ‘publishing’ as a social act, involving at least two parties, an author and an audience, not necessarily always brought together. The former prepares a literary work and then makes it available to the latter. Medieval publishing was probably more often a more complex process. It could engage more parties than the two, such as commentators, dedicatees, and commissioners. The social status of these networks ranged from mediocre to grand. They could consist of otherwise unknown monks; or they could include popes and emperors.

We propose that the composition of such literary networks was broadly reactive to large-scale societal and cultural changes. If so, networks of publishing can serve as a vantage point for the observation of continuity and change in medieval societies. We shall collect and analyse an abundance of data of publishing networks in order to trace how their composition in various contexts may reflect the wider world. It is that last-mentioned aspect that is the subject of this proposal.

It is a central fact for this undertaking that medieval works very often include information on dedication, commission, and commendation; and that, more often than not, this evidence is uncomplicated to collect because the statements in question tend to be short and uniform and they normally appear in the prefaces and dedicatory letters with which medieval authors often opened their works. What is more, such accounts manifestly indicate a bond between two or more parties. By virtue of these features, the evidence in question can be collected in the quantities needed for large-scale statistical analysis and processed electronically. The function and form of medieval references to dedication and commission, furthermore, remained largely a constant. Eleventh-century dedications resemble those from, say, the fourteenth century. By virtue of such uniformity the data of dedications and commissions may well constitute a unique pool of evidence of social interaction in the Middle Ages. For the data of dedications and commissions can be employed as statistical evidence in various regional, chronological, social, and institutional contexts, something that is very rare in medieval studies.

The proposed paper will introduce the categories of information the database is to embrace and put forward for discussion the modus operandi of how the data of dedications and commissions will be harvested.

Poster [abstract]

Making a bibliography using metadata

Lars Bagøien Johnsen, Arthur Tennøe

National Library of Norway, Norway,

In this presentation we will discuss how one might create a bibliography using metadata taken from libraries in conjunction with other sources. As metadata, like topic keywords and Dewey decimal classification, is digitally available our focus is on metadata, although we also look at book contents where it is possible.

Poster [abstract]

Network Analysis, Network Modeling, and Historical Big Data: The New Networks of Japanese Americans in World War II

Saara Kekki

University of Helsinki

Network analysis has become a promising methodology for studying a wide variety of systems, including historical populations. It brings new dimensions into the study of questions that social scientists and historians might traditionally ask, and allows for new questions that were previously impractical or impossible to answer using traditional methods. The increasing availability of digitized archival material and big data, however, are making it more appealing. When coupled with custom algorithms and interactive visualization tools, network analysis can produce remarkable new insights.

In my ongoing doctoral research, I am employing network analysis and modeling to study the Japanese American incarceration in World War II (internment). Incarceration and the government-led dispersal of Japanese Americans disrupted the lives of some 110,000 people, including over 70,000 US citizens of Japanese ancestry, for the duration of the war and beyond. Many lost their former homes and enterprises and had to start their lives over after the war. Incarceration also had a very concrete impact on the communities: about 50% of those interned did not return to their old homes.

This paper explores the changes that took place in the Japanese American community of Heart Mountain Relocation Center in Wyoming. I will especially investigate the political networks and power relations of the incarceration community. My aim is twofold: on the one hand, to discuss the changes in networks caused by incarceration and dispersal, and on the other, to address some opportunities and challenges presented by the method for the study of history.

Poster [abstract]

SuALT: Collaborative Research Infrastructure for Archaeological Finds and Public Engagement through Linked Open Data

Suzie Thomas¹, Anna Wessman¹, Jouni Tuominen^2,3, Mikko Koho², Esko Ikkala², Eero Hyvönen^2,3, Ville Rohiola⁴, Ulla Salmela⁴

¹University of Helsinki,Department of Philosophy, History, Culture and Art Studies; ²Aalto University, Semantic Computing Research Group (SeCo); ³University of Helsinki, HELDIG – Helsinki Centre for Digital Humanities; ⁴National Board of Antiquities, Library, Archives and Archaeological Collections

The Finnish Archaeological Finds Recording Linked Database (Suomen arkeologisten löytöjen linkitetty tietokanta – SuALT) is a concept for a digital web service catering for discoveries of archaeological material made by the public; especially, but not exclusively, metal detectorists. SuALT, a consortium project funded by the Academy of Finland and commenced in September 2017, has key outputs at every stage of its development. Ultimately it provides a sustainable output in the form of Linked Data, continuing to facilitate new public engagements with cultural heritage, and research opportunities, long after the project has ended.

While prohibited in some countries, metal detecting is legal in Finland, provided certain rules are followed, such as prompt reporting of finds to the appropriate authorities and avoidance of legally-protected sites. Despite misgivings by some about the value of researching metal-detected finds, others have demonstrated the potential of researching such finds, for example uncovering previously unknown artefact typologies. Engaging non-professionals with cultural heritage also contributes to the democratization of archaeology, and empowers citizens. In Finland metal detecting has grown rapidly in recent years. In 2011 the Archaeological Collections registered 31 single or assemblages of stray finds. In 2014, over 2700 objects were registered, in 2015, near 3000. In 2016 over 2500 finds were registered. When the finds are reported correctly, their research value is significant. The Finnish Antiquities Act §16 obligates the finder of an object for which the owner is not known, and which can be expected to be at least 100 years old, to submit or report the object and associated information to the National Board of Antiquities (Museovirasto – NBA); the agency responsible for cultural heritage management in Finland. There is also a risk, as finders get older and even pass away, that their discoveries and collections will remain unrecorded and that all associated information is lost permanently.

In the current state of the art, while archaeologists increasingly use finds information and other data, utilization is still limited. Data can be hard to find, and available open data remains fragmented. SuALT will speed up the process of recording finds data. Because much of this data will be from outside of formal archaeological excavations, it may shed light on sites and features not usually picked up through ‘traditional’ fieldwork approaches, such as previously unknown conflict sites. The interdisciplinary approach and inclusion of user research promotes collaboration among the infrastructure’s producers, processors and consumers. By linking in with European projects, SuALT enables not only national and regional studies, but also contributes to international and transnational studies. This is significant for studies of different archaeological periods, for which the material culture usually transcends contemporary national boundaries. Ethical aspects are challenged due to the debates around engagement with metal detectorists and other artefact hunters by cultural heritage professionals and researchers, and we address head-on the wider questions around data sharing and knowledge ownership, and of working with human subjects. This includes the issues, as identified by colleagues working similar projects elsewhere, around the concerns of metal detectorists and other finders about sharing findspot information. Finally, the usability of datasets has to be addressed, considering for example controlled vocabulary to ease object type categorization, interoperability with other datasets, and the mechanics of verification and publication processes.

The project is unique in responding to the archaeological conditions in Finland, and in providing solutions to its users’ needs within the context of Finnish society and cultural heritage legislation. While it focuses primarily on the metal detecting community, its results and the software tools developed are applicable more generally to other fields of citizen science in cultural heritage, and even beyond. For example, in many areas of collecting (e.g. coins, stamps, guns, or art), much cultural heritage knowledge as well as collections are accumulated and maintained by skilful amateurs and private collectors. Fostering collaboration, and integrating and linking these resources with those in national memory organizations would be beneficial to all parties involved, and points to future applications of the model developed by SuALT. Furthermore, there is scope to integrate SuALT into wider digital humanities networks such as DARIAH (http://www.dariah.eu).

Framing SuALT’s development as a consortium enables us to ask important questions even at development stages, with the benefit of expertise from diverse disciplines and research environments. The benefits of SuALT, aside from the huge potential for regional, national, and transnational research projects and international collaboration, are that it offers long term savings on costs, shares expertise and provides greater sustainability than already possible. We will explore the feasibility of publishing the finds data through international aggregation portals, such as Europeana (http://www.europeana.eu) for cultural heritage content, as well as working closely with colleagues in countries that already have established national finds databases. The technical implementation also respects the enterprise architecture of Finnish public government. Existing Open Source solutions are further developed and integrated, for example the GIS platform Oskari.org (http://oskari.org) for geodata developed by the National Land Survey with the Linked Data based Finnish Ontology Service of Historical Places and Maps (http://hipla.fi). SuALT’s data is also disseminated through Finna (http://www.finna.fi), a leading service for searching cultural information in Finland.

SuALT consists of three subprojects: subproject I “User Needs and Public Cultural Heritage Interactions” hosted by University of Helsinki; subproject II “National Linked Open Data Service of Archaeological Finds in Finland” hosted by Aalto University, and subproject III “Ensuring Sustainability of SuALT” hosted by the NBA.

The primary aim of SuALT is to produce an open Linked Data service which is used by data producers (namely the metal detectorists and other finders of archaeological material), by data researchers (such as archaeologists, museum curators and the wider public), and by cultural heritage managers (NBA). More specifically, the aims are:

a. To discover and analyse the needs of potential users of the resource, and to factor these findings into its development;

b. To develop metadata models and related ontologies for the data that take into account the specific needs of this particular infrastructure, informed by existing models;

c. To develop the Linked Data model in a way that makes it semantically interoperable with existing cultural heritage databases within Finland;

d. To develop the Linked Data model in a way that makes it semantically interoperable with comparable ‘finds databases’ elsewhere in Europe, and

e. To test the data resulting from SuALT through exploratory research of the datasets for archaeological research purposes for cultural heritage and collection management work.

The project corresponds closely with the strategic plans of the NBA and responds to the growth of metal detecting in Finland. Internationally, it corresponds with the development of comparable schemes in other European countries and regions, such as Flanders (MetaaldEtectie en Archeologie – MEDEA initiated in 2014), and Denmark and the Netherlands (Digitale Metaldetektorfund or DIgital MEtal detector finds – DIME, and Portable Antiquities in the Netherlands – PAN, both initiated in 2016). It takes inspiration from the Portable Antiquities Scheme (PAS) Finds Database (https://finds.org.uk/database) in England and Wales. These all aspire to an ultimate goal of a pan-European research infrastructure, and will work together to seek a larger international collaborative research grant in the future. A contribution of our work in relation to the other European projects is to employ the Linked Data paradigm, which facilitates better interoperability with related datasets, additional data enrichment based on well-defined semantics and reasoning, and therefore better means for analysing and using the finds data in research and applications.

The expected scientific impacts are that the process of developing SuALT, including critically analysing comparable resources, user group research, and creating innovative solutions, will in themselves produce a rich body of interdisciplinary academic output. This will be disseminated in peer reviewed journals and at selected conferences across several disciplinary boundaries including Computer Science, Archaeology, and Cultural Heritage Studies. It also links in, at a crucial moment in the development of digital heritage management, with parallel resources elsewhere in Europe. This means that not only can a coordinated and international approach be taken in development, but that it is extremely timely, taking advantage of the opportunity to benefit from the experiences and perspectives of colleagues pursuing similar resources. SuALT ensures that Finnish cultural heritage management is at the forefront of digital heritage. The project also carries out a small-scale ‘test’ project using the database as it forms, and in this way contributes to the field of artefact studies. The contribution to future knowledge sits at a number of levels. There are technical challenges to create the linked database in a way that complements and is interoperable with existing national and international infrastructures. Solving these challenges generates contributions to understanding digital data management and service. The process of consulting users represents an important case study in formative evaluation of particular interest groups with regard to digital heritage and citizen science, as well as shedding further light on different perceptions and uses of cultural heritage. SuALT relates to the emerging trend of publishing open science data, facilitating the analysis and reuse of the data, exemplified by e.g. DataONE (http://www.dataone.org) and Open Science Data Cloud (http://www.opensciencedatacloud.org).

We hypothesise that SuALT will result in a sustainable digital data resource that responds to the different user needs, and which provides high quality archaeological research which draws on data from Finland. SuALT also enables integration with comparative data from abroad. Outputs throughout the development process represent important contributions to research into digital heritage applications and semantic computing, going the needs of the scientific community. The selected Linked Data methodology is suitable for archaeology and cultural heritage management due to the need to combine and connect heterogeneous data collections in the field (e.g. museum collections, finds databases abroad) and other datasets, such as vocabularies of places, persons, and time periods, benefiting cultural heritage professionals. Publishing the finds database as open data using standardised metadata formats facilitates the data’s re-use, fostering new research by the scientific community but also the development of novel applications for professionals and citizens. Taking a strategic approach to the challenge of creating this resource, and treating it as a research project, rather than developing an ad hoc resource, ensures that the project’s legacy is a significant and long-term contribution to digital curation of public-generated archaeological data.

As its key societal impact, SuALT provides a vital interface for non-professionals to contribute to and benefit from Finland’s archaeological record, and to integrate this with comparable datasets from abroad. The project enhances cooperation between non-professionals and cultural heritage managers. Careful user research ensures that SuALT offers means of engagement and access to data and other information that is usable and meaningful to a wide range of users, from metal detectorists and amateur historians, through to professional curators, cultural heritage managers, and academic researchers, domestically and abroad. SuALT’s results are not limited to metal detection but have a wider impact: the same key challenges of engaging amateur collectors to collaborate with memory organization experts in citizen science are encountered in virtually all fields of collecting and maintaining tangible and intangible cultural heritage.

The process of developing SuALT provides an unprecedented opportunity to research the use of digital platforms to engage the public with archaeological heritage in Finland. Inspired by successful initiatives such as PAS and MEDEA, the potential for individuals to self-record their finds also echoes the emerging use of crowdsourcing for public archaeology initiatives. Thus, SuALT offers a significant opportunity to contribute to further understanding digital cultural heritage and its uses, including its role within society. It is likely that the coordination of SuALT with digital finds recording initiatives in other countries will lead to a transnational platform for finds recording, giving Finland an opportunity to be at the forefront of digital heritage-based citizen science research and development.

Poster [abstract]

Identifying poetry based on library catalogue metadata

Hege Roivainen

University of Helsinki,

Changes in printing reflect historical turning points: what has been printed, when, where and by whom are all derivatives of contemporary events and situations. Excessive need for war propaganda brings out more pamphlets from the printing presses, the university towns produce dissertations, which scientific development can be deduced from and strict oppression and censorship might allow only religious publications by government-approved publishers. The history of printing has been extensively studied and numerous monographs exist. However, most of the research has been qualitative studies based on close reading requiring a profound knowledge of the subject matter, yet still being unable to verify the extent of the new innovations. For example, close readings of library catalogues does not reveal, at least easily, the timeline of Luther’s publications, or what portion of books actually were octavo-sized and when the increase in this format occurred.

One of the sources for these kinds of studies are national library metadata catalogs which contain information about physical book size, page counts, publishers, publication places and so forth. These catalogs have been researched in ways making use of quantitative analysis. The advantage of national library catalogs is that they often are more or less complete, having records of practically everything published in a certain country or linguistic area in a certain time period. The computational approach to them has enabled researchers to connect historical turning points to the effect on printing, and the impact of a new concept has been measured against the amount of re-publications, or the spread, of a book introducing a new idea. What is more, linking library metadata to the full text of the books has made it possible to analyze the change in the usage of words in massive corpora, while still limiting analysis to relevant books.

In all these cases, computational methods work better the more complete the corpus is. However, library catalogues often lack annotations for one reason or another: annotating resources might have been cut at a certain point in time, or the annotation rules may have varied between different libraries in cases where catalogues have been amalgamated, or the rules could have just changed.

One area that is particularly important for subcorpora research is genre. The genre field, when annotated for each of the metadata records, could be used to restrict the corpus to contain every one of the books that are needed and nothing more. From this subset there is a possibility of drawing timelines or graphs based on bibliographic metadata, or in the case of full texts existing, the language or contents of a complete corpus could be analysed. Despite the significance of the genre information, that particular annotation bit is often lacking.

In English Short Title Catalogue (ESTC) the genre information exists for approximately one fourth of the records. This should be enough for teaching a model for machine learning and trying to deduce the genre information, rather than relying solely on the annotations of librarians. The metadata field containing genre information in ESTC can contain more than one value. In most cases this means having a category and its subcategories as different values, but not always. Because of the complex definition of genre in ESTC this paper focuses on one genre only: poetry. Besides being a relatively common genre, poetry is also of interest to literary researchers. Having a nearly complete subset of English poetry would allow for large-scale quantitative poetry analysis.

The downside to library metadata catalogues is, that they contain merely the metadata, not the complete unabridged texts, which would be beneficial for machine learning modeling. I tackled this shortcoming by creating several models each packed with similar features within that set. The main ingredient for these feature sets was a concatenation of the main title and the subtitle from the library metadata. From these concatenations I created one feature set contained easily calculable features known from the earliest stylometric research, such as word counts and sentence lengths. Another set I collected with bag-of-words method taking the frequencies of the most common words from a subset of poetry book titles. I also built one set for part-of-speech (POS) tags and another one for POS trigrams. Some feature sets were extracted from the other metadata fields. Physical book size, page count, topic and the same author having published a poetry book proved worthy in the classification.

From these feature sets I handpicked the best performing features into one superset. The resulting model performed really good: despite the compactness of the metadata, the poetry books could be tracked with a precision over 90% and a recall over 86%. I then made another run with the superset to seek the poetry books, which did not have genre field annotated in the catalogue. Combining the results from the run with close reading revealed over 14,000 unannotated poetry books. I sampled one hundred of both poetry and non-poetry books to manually estimate the correctness of the predictions and found out an annotation bias in the catalogue. The bias seems to come from the fact, that the genre information has been annotated more frequently for broadside poetry books, than for the other broadsides. Excluding broadsides from my samples I got a recall value 94% and precision 98%.

My research strongly suggest, that semi-supervised learning can be applied with library catalogues to fill in missing annotations, but this requires close attention to avoid possible pitfalls.

Poster [publication ready]

Open Digital Humanities: International Relations in PARTHENOS

Bente Maegaard

University of Copenhagen, CLARIN ERIC

One of the strong instruments for the promotion of Open Science in Digital Humanities is research infrastructures. PARTHENOS is a European research infrastructure project, basically built upon collaboration between two large the research infrastructures in the humanities CLARIN and DARIAH, plus a number of other initiatives. PARTHENOS aims at strengthening the cohesion of research in the broad sector of Linguistic Studies, Humanities, Cultural Heritage, History, Archaeology and related fields. This is the context in which we should see the efforts related to international liaisons. This effort takes its point of departure in the existing international relations, so the first action was to collect information and to analyse it along different dimensions. Secondly, we want to analyse the purpose and aims of international collaboration. There are many ideas about how the international network may be strengthened and exploited, so that higher quality is obtained, and more data, tools and services are shared. The main task of the next year will be to first agree on a strategy and then implement it in collaboration with the rest of the project. By doing so, the PARTHENOS partners will be contributing even more to the European Open Science Policies.

Poster [abstract]

The New Face of Ethnography: Utilizing Cyberspace as an Alternative Study Site

Karen Lisa Deeming

University of California, Merced,

American adoption has a familiar mission to find families for children but becomes strange when turned on its head and exposed as an institution that instead finds children for families who are willing to pay any price for a child. Its evolution, from orphan trains to open adoptions, has answered questions about biological associations but has conflated the interconnection of identity with conflicting narratives of community, kinship and self. How do the experiences of the adoption constellation reconceptualize the national image of adoption as a win-win solution to a social problem? My research explores the language utilized in multiple adoption narratives to determine individual and universal feelings that adoptees, birth parents, and adoptive parents experience regarding the transfer of children in the United States and the long term emotional outcomes for these groups. My unique approach to ethnographic research includes a hybrid digital and humanistic approach using online and offline interactions to gather data.

As is the case with all methodology, online ethnography presents both benefits and problems. On the plus side, online communities break down the walls of networks, creating digitally mediated social spaces. The Internet provides a platform for social interactions where real and virtual worlds shift and conflate. Social interactions in cybernetic environments present another option for social researchers and offer significant advantages for data collection, collaboration, and maintenance of research relationships. For some research subjects, such as members of the adoption constellation, locating target groups presents challenges for domestic adoption researchers. Online groups such as Facebook pages dedicated to specific members of the adoption triad offer a resolution to this challenge, acting as self-sorted focus groups with participants eager to provide their narratives and experiences. Ethnography involves understanding how people experience their lives through observation and non-directed interaction, with a goal of observing participants’ behavior and reactions on their own terms; this can be achieved through the presumed anonymity of online interaction. Electronic ethnography provides valuable insights and data; however, on the negative side, the danger of groupthink in Facebook communities can both attract and generate homogeneous experiences regarding adoption issues. I argue that the benefit of online ethnography outweighs the problems and can provide important, previously unexpressed views to better analyze topics such as the adoption experience. Social interactions in cybernetic environments offer significant advantages for data collection, collaboration, and maintenance of research relationships as it remains a fluid yet stable alternate social space.

Late-Breaking Work

Elias Lönnrot Letters Online

Kirsi Keravuori, Maria Niku

Finnish Literature Society

The correspondence of Elias Lönnrot (1802–1884, doctor, philologist and creator of the national epic Kalevala) comprises of 2 500 letters or drafts written by Lönnrot and 3 500 letters received. Elias Lönnrot Letters Online (http://lonnrot.finlit.fi/omeka/), first published in April 2017, is the conlusion of several decades of research, of transcribing and digitizing letters and of writing commentaries. The online edition is designed not only for those interested in the life and work of Lönnrot himself, but more generally to scholars and general public interested in the work and mentality of the Finnish 19th century nationalistic academic community , their language practices both in Swedish and in Finnish, and in the study of epistolary culture. The rich, versatile correspondence offers source material for research in biography, folklores studies and literary studies; for general history as well as medical history and the history of ideas; for the study of ego documents and networks; and for corpus linguistics and history of language.

As of January 2018, the edition contains about 2000 letters and drafts of letters sent by Elias Lönnrot (1802-1884, doctor, philologist and creator of the national epic Kalevala). These are mostly private letters. The official letters, such as the medical reports submitted by Lönnrot in his office as a physician, will be added during 2018. The final stage will involve finding a suitable way of publishing for the approximately 3500 letters that Lönnrot received.

The edition is built on the open-source publishing platform Omeka. Each letter and draft of letter is published as facsimile images and an XML/TEI5 file, which contains metadata and transcription. The letters are organised into collections according to recipient, with the exception of for example Lönnrot's family letters, which are published in a single collection. An open text search covers the metadata and transcriptions. This is a faceted search powered by Apache's Solr which allows limiting the initial search by collection, date, language, type of document and writing location. In addition, Omeka's own search can be used to find letters based on a handful of metadata fields.

The solutions adopted for the Lönnrot edition differ in some respects from the established practices of digital publishing of manuscripts in the humanities. In particular, the TEI encoding of the transcriptions is lighter than in many other scholarly editions. Lönnrot's own markings – underlinings, additions, deletions – and unclear and indecipherable sections in the texts are encoded, but place and personal names are not. This is partially due to the extensive amount of work such detailed encoding would require, partially because the open text search provides quick and easy access to the same information.

The guiding principle of Elias Lönnrot Letters is openness of data. All the data contained in the edition is made openly available.

Firstly, the XML/TEI5 files are available for download, and researchers and other users are free to modify them for their own purposes. The users can download the XML/TEI5 files of all the letters, or of a smaller section such as an individual collection. The feature is also integrated in the open text search, and can be used both for all the results produced by a search and a smaller section of the results limited by one or more facets. Thus, an individual researcher can download the XML files of the letters and study them for example with the linguistic tools provided by the Language Bank of Finland. Similarly, the raw data is available for processing and modifying by those researchers who use and develop digital humanities tools and methods to solve research questions.

Secondly, the letter transcriptions are made available for download as plain text. Data in this format is needed for qualitative analysis tools like Atlas. In addition, researchers in humanities do not all need XML files but will benefit from the ability to store relevant data in an easily readable format.

Thirdly, users of the edition can export the statistical data contained in the facet listing of each search result for processing and visualization with tools like Excel. Statistical data like this is significant in handling large masses of data, as it can reveal aspects that would remain hidden when examining individual documents. For example, it may be relevant to a researcher in what era and with whom Lönnrot primarily discussed a given theme. The statistical data of the facet search readily reveals such information, while compiling such statistics by manually going through thousands of letters would be an impossibly long process.

The easy availability of data in Elias Lönnrot Letters Online will hopefully foster collaboration and enrich research in general. The SKS is already collaborating with Finn-Clarin and the Language Bank, which have received the XML/TEI5 files. As Lönnrot's letters form an exceptionally large collection of manuscripts written by one hand, a section of the letters together with their transcriptions was given to the international READ project, which is working to develop machine recognition of old handwritten texts. A third collaborating partner is the project "STRATAS – Intefacing structured and unstructured data in sociolinguistic research on language change".

Late-Breaking Work

KuKa Digi -project

Tiina H. Airaksinen, Anna-Leena Korpijärvi

University of Helsinki

This poster presents a sample of the Cultural Studies BA program’s Digital Leap project called KuKa Digi. The Digital Leap is a university wide project that aims to support digitalization in both learning and teaching in the new degree programs at the University of Helsinki. For more information on the University of Helsinki’s Digital Leap program, please refer to: http://blogs.helsinki.fi/digiloikka/ . The new Bachelor’s Program in Cultural Studies, was among the projects selected for the 2018-2019 round of the Digital Leap. The primary goal of the KuKa Digi project is to produce meaningful digital material for both teaching and learning purposes. The KuKa Digi project aims to develop the program’s courses, learning environments and materials into a more digital direction. Another goal of the project is to produce an introductory MOOC –course on Cultural Studies for university students, as well as students studying for their A-levels, who may be planning to apply for the Cultural Studies BA program. Finally, we will write a research article to assess the use of digital environments in teaching and learning processes within Cultural Studies BA program. Kuka Digi –project encourages students and teachers to co-operatively plan digital learning environments that are also useful in building up students’ academic portfolio and enhance their working life skills.

The core idea of the project is to create a digital platform or database for teachers, researchers and students in the field of Cultural Studies. Academic networking sites do exist, however they are not without issues. Many of them are either not accessible, or very useful for students, who have not developed their academic careers very far yet. In addition to this, some of these sites are only partially free of charge. The digital platform will act as a place where students, teachers and researchers alike can have the opportunity to network, advertise their expertise and specialization as well as, come into contact with the media, cultural agencies, companies and much more. The general vision for this platform is that it will be user friendly, flexible as well as, act as an “academic Linked In”. The database will be available in Finnish, Swedish and English. The database will include the current students, teachers and experts, who are associated with the program. Furthermore, the platform will include a feature called the digital

portfolio. This will be especially useful for our students, as it is intended to be a digital tool with which they can develop their own expertise within the field of Cultural Studies. Finally, the portfolio will act as a digital business card for the students. The Project poster presented at the conference illustrates the ideas and concepts for the platform in more detail.

For more information on the project and its other goals, please refer to the project blog at:

http://blogs.helsinki.fi/kuka-digi/

Late-Breaking Work

Topic modelling and qualitative textual analysis

Karoliina Isoaho, Daria Gritsenko

University of Helsinki,

The pursuit of big data is transforming qualitative textual analysis—a laborious activity that has conventionally been executed manually by researchers. Access to data of unprecedented scale and scope has created a need to both analyse large data sets efficiently and react to their emergence in a near-real-time manner (Mills, 2017). As a result, research practices are also changing. A growing number of scholars have experimented with using machine learning as the main or complementary method for text analysis. Even if the most audacious assumptions ‘on the superior forms of intelligence and erudition’ of big data analysis are today critically challenged by qualitative and mixed-method researchers (Mills, 2017: 2), it is imperative for scholars using qualitative methods to consider the role of computational techniques in their research (Janasik, Honkela and Bruun, 2009). Social scientists are especially intrigued by the potential of topic modelling (TM), a machine learning method for big data analysis (Blei, 2012), as a tool for analysis of textual data.

This research contributes to a critical discussion in social science methodologies: how topic modeling can concretely be incorporated into existing processes of qualitative textual analysis and interpretation. Some recent studies paid attention to the methodological dimensions of TM vis-à-vis textual analysis. However, these developments remain sporadic, exemplifying a need for a systematic account of the conditions under which TM can be useful for social scientists engaged in textual analysis. This paper builds upon the existing discussions, and takes a step further by comparing the assumptions, analytical procedures and conventional usage of qualitative textual analysis methods and TM. Our findings show that for content and classification methods, embedding TM into research design can partially and, arguably, in some cases fully automate the analysis. Discourse and representation methods can be augmented with TM in sequential mixed-method research design.

Summing up, we see avenues for TM both in embedded and sequential mixed-method research design. This is in line with previous work on mixed-method research that has challenged the traditional assumption of there being a clear division between qualitative and quantitative methods. Scholarly capacity to craft a robust research design depends on researchers’ familiarity with specific techniques, their epistemological assumptions, and good knowledge of the phenomena that are being investigated to facilitate the substantial interpretation of the results. We expect this research to help identify and address the critical points, thereby assisting researchers in the development of novel mixed-method designs that unlock the potential of TM in qualitative textual analysis without compromising methodological robustness.

Blei, D. M. (2012) ‘Probabilistic topic models’, Communications of the ACM, 55(4), p. 77. Janasik, N., Honkela, T. and Bruun, H. (2009) ‘Text Mining in Qualitative Research’, Organizational Research Methods, 12(3), pp. 436–460.

Mills, K. A. (2017) ‘What are the threats and potentials of big data for qualitative research?’, Qualitative Research, p. 146879411774346.

Late-Breaking Work

Local Letters to Newspapers - Digital History Project

Heikki Kokko

University of Tampere, The Centre of Excellence in the History of Experiences (HEX)

The Local Letters to Newspapers is a digital history project of the Academy of Finland Centre of Excellence in the History of Experiences HEX (2018–2025), hosted by University of Tampere. The objective is to make a new kind of digital research material available from the 19th and the early 20th century Finnish society. The aim is to introduce a database of the readers' letters submitted to the Finnish press that could be studied both qualitatively and quantitatively. The database will allow analyzing the 19th and 20th century global reality through a case study of the Finnish society. It will enable a wide range of research topics and open a path to various research approaches, especially the study of human experiences.

Late-Breaking Work

Lessons Learned from Historical Pandemics. Using crowdsourcing 2.0 and Citizen Science to map the Spanish Flus spatial and social network.

Søren Poder

Aarhus City Archives

By Søren K. Poder MA. In history & Astrid Lykke Birkving, MA in intellectual History

Aarhus City Archvies | Redia a/s

In 1918 the World was struck by the most devastating disease in recorded history - today known as the Spanish Flu. In less than one year nearly two third of world’s population came down with influenza. Of which between forty and one hundred million people died.

The Spanish Flu in 1918 did not originated in Spain, but most likely on the North American east coast in February 1918. By the middle of Marts, the influenza had spread to most of the overcrowded American army camps from where it soon was carried to the trenches in France and the rest of the World. This part of the story is well known. In contrast the diffusion of the 1918-pandemic, and the seasonal epidemics for that matter, on the regional and local level is still largely obscure. For instance, an explanation on why epidemics evidently tends to follow significantly different paths in different urban areas that otherwise seems to share a common social, commercial and cultural profile, tend to be more theoretical then based on evidence. For one sole reason – the lack of adequate data.

As part of the incessantly scientific interest in historical epidemics, the purpose of this research project is to identify the social, economic and cultural preconditions that most likely determines a given type of locality’s ability to spread or halter an epidemic’s hieratical diffusion.

Crowdsourcing 2.0

To meet ends data large amounts of data from a variety of different historical sources as to be collected and linked together. To do this we use traditional crowdsourcing techniques, where volunteers participates in transcribing different historical documents. Death certificates, census, patient charts etc. But just as important does the collected transcription form the base for a text recognition ML module that in time will be able recognize specific entities in a document – persons, placers, diagnoses dates ect.

Late-Breaking Work

Analysing Swedish Parliamentary Voting Data

Jacobo Rouces, Nina Tahmasebi, Lars Borin, Stian Rødven Eide

University of Gothenburg,

We used publicly available data from voting sessions in the Swedish Parliament to represent each member of parliament (MP) as a vector in a space defined by their voting record between the years 2014 and 2017. We then applied matrix factorization techniques that enabled us to find insightful projections of this data. Namely, it allowed the assessment of the level of clustering of MPs according to their party line while at the same time identifying MPs whose voting record is closer to other parties'. It also provided a data-driven multi-dimensional political compass that allows to ascertain similitudes and differences between MPs and political parties. Currently, the axes of the compass are unlabeled and therefore they lack a clear interpretation, but we plan to apply language technology on the parliamentary discussions associated to the voting sessions on order to identify the topics associated to these axis.

Late-Breaking Work

Automated Cognate Discovery in the Context of Low-Resource Sami Languages

Eliel Soisalon-Soininen, Mika Hämäläinen

University of Helsinki

1 Introduction

The goal of our project is to automatically find candidates for etymologically related words, known as cognates, for different Sami languages. At first, we will focus on North Sami, South Sami and Skolt Sami nouns by comparing their inflectional forms with each other. The reason why we look at the inflections is that, in Uralic languages, it is common that there are changes in the word stem when the word is inflected in different cases. When finding cognates, the non-nominative stems might reveal more about a cognate relationship in some cases. For example, the South Sami word for arm, g ̈ıete, is closer to the partitive of the Finnish word k ̈att ̈a than to the nominative form k ̈asi of the same word.

The fact that a great deal of previous work already exists related to etymolo- gies of words in different Sami languages [2, 4, 8] provides us with an interesting test bed for developing our automatic methods. The results can easily be vali- dated against databases such as A ́lgu [1] which incorporates results of different studies in Sami etymology in a machine-readable database.

With the help of a gold corpus, such as A ́lgu, we can perfect our method to function well in the case of the three aforementioned Sami languages. Later, we can expand the set of languages used to other Uralic languages such as Erzya and Moksha. This is achievable as we are basing our method on the data and tools developed in the Giellatekno infrastructure [11] for Uralic languages. Giellatekno has a harmonized set of tools and dictionaries for around 20 different Uralic languages allowing us to bootstrap more languages into our method.

2 Related Work

In historical linguistics, cognate sets have been traditionally identified using the comparative method, the manual identification of systematic sound corre- spondences across words in pairs of languages. Along with the rapid increase in digitally available language data, computational approaches to automate this process have become increasingly attractive.

Computationally, automatic cognate identification can be considered a prob- lem of clustering similar strings together, according to pairwise similarity scores given by some distance metric. Another approach to the problem is pairwise classification of word pairs as cognates or non-cognates. Examples of common distance metrics for string comparison include edit distance, longest common subsequence, and Dice coefficient.

The string edit distance is often used as a baseline for word comparison, measuring word similarity simply as the amount of character or phoneme in- sertions, deletions, and substitutions required to make one word equivalent to the other. However, in language change, certain sound correspondences are more likely than others. Several methods rely on such linguistic knowledge by convert- ing sounds into sound classes according to phonetic similarity [?]. For example, [15] consider a pair of words to be cognates when they match in their first two consonant classes.

In addition to such heuristics, a common approach to automatic cognate identification is to use edit distance metrics using weightings based on previ- ously identified regular sound correspondences. Such correspondences can also be learned automatically by aligning the characters of a set of initial cognate pairs [3,7]. In addition to sound correspondences, [14] and [6] also utilise se- mantic information of word pairs, as cognates tend to have similar, though not necessarily equivalent, meaning. Another method heavily reliant on prior lin- guistic knowledge is the LexStat method [9], requiring a sound correspondence matrix, and semantic alignment.

However, in the context of low-resource languages, prior linguistic knowledge such as initial cognate sets, semantic information, or phonetic transcriptions are rarely available. Therefore, cognate identification methods applicable to low- resource languages calls for unsupervised approaches. For example, [10] address this issue by investigating edit distance metrics based on embedding characters into a vector space, where character similarity depends on the set of characters they co-occur with. In addition, [12] investigate several unsupervised approaches such as hidden Markov models and pointwise mutual information, while also combining these with heuristic methods for improved performance.

3 Corpus

The initial plan is to base our method on the nominal XML dictionaries for the three Sami languages available on the Giellatekno infrastructure. Apart from just translations, these dictionaries contain also additional lexical information to a varying degree. The additional information which might benefit our re- search goals are cognate relationships, semantic tags, morphological information, derivation and example sentences.

For each noun the noun dictionaries, we produce a list of all its inflections in different grammatical numbers and cases. This is done by using a Python library called Uralic NLP [5], specialized in NLP for Uralic languages. Uralic NLP uses FSTs (finite-state-transducers) from the Giellatekno infrastructure to produce the different morphological forms.

We are also considering a possibility of including larger text corpora in these languages as a part of our method for finding cognates. However, theses languages

have notoriously small corpora available, which might render them insufficient for our purposes.

4 Future Work

Our research is currently at its early stages. The immediate future task is to start implementing different methods based on the previous research to solve the problem. We will first start with edit distance approaches to see what kind of information those can reveal and move towards a more complex solution from there.

A longer-term future plan is to include more languages into the research. We are also interested in a collaboration with linguists who could take a more qualitative look at the cognates found by our method. This will nourish inter- disciplinary collaboration and exchange of ideas between scholars of different backgrounds.

We are also committed to releasing the results produced by our method to a wider audience to use and profit from. This will be done by including the results as a part of the XML dictionaries in the Giellatekno infrastructure and also by releasing them in an open-access MediaWiki based dictionary for Uralic languages [13] developed in the University of Helsinki.

References

1. A ́lgu-tietokanta. saamelaiskielten etymologinen tietokanta (Nov 2006), http://kaino.kotus.fi/algu/

2. Aikio, A.: The Saami loanwords in Finnish and Karelian. Ph.D. thesis, University of Oulu, Faculty of Humanities (2009)

3. Ciobanu, A.M., Dinu, L.P.: Automatic detection of cognates using orthographic alignment. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). vol. 2, pp. 99–105 (2014)

4. Ha ̈kkinen, K.: Suomen kirjakielen saamelaiset lainat. Teoksessa Sa ́mit, sa ́nit, sa ́tneha ́mit. Riepmoˇca ́la Pekka Sammallahtii miessema ́nu 21, 161–182 (2007)

5. Ha ̈ma ̈la ̈inen, M.: UralicNLP (Jan 2018), https://doi.org/10.5281/zenodo.1143638, doi: 10.5281/zenodo.1143638

6. Hauer, B., Kondrak, G.: Clustering semantically equivalent words into cognate sets in multilingual lists. In: Proceedings of 5th international joint conference on natural language processing. pp. 865–873 (2011)

7. Kondrak, G.: Identification of cognates and recurrent sound correspondences in word lists. TAL 50(2), 201–235 (2009)

8. Koponen, E.: Lappische lehnwo ̈rter im finnischen und karelischen. Lapponica et Uralica. 100 Jahre finnisch-ugrischer Unterricht an der Universita ̈t Uppsala. Vortra ̈ge am Jubil ̈aumssymposium 20.–23. April 1994 pp. 83–98 (1996)

9. List,J.M.,Greenhill,S.J.,Gray,R.D.:Thepotentialofautomaticwordcomparison for historical linguistics. PloS one 12(1), e0170046 (2017)

10. McCoy, R.T., Frank, R.: Phonologically informed edit distance algorithms for word alignment with low-resource languages. Proceedings of

11. Moshagen, S.N., Pirinen, T.A., Trosterud, T.: Building an open-source develop- ment infrastructure for language technology projects. In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16. pp. 343–352. No. 85, Linkping University Electronic Press; Linkpings universitet (2013)

12. Rama, T., Wahle, J., Sofroniev, P., Ja ̈ger, G.: Fast and unsupervised methods for multilingual cognate clustering. arXiv preprint arXiv:1702.04938 (2017)

13. Rueter, J., Ha ̈m ̈al ̈ainen, M.: Synchronized mediawiki based analyzer dictionary development. In: Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages. pp. 1–7 (2017)

14. St Arnaud, A., Beck, D., Kondrak, G.: Identifying cognate sets across dictionaries of related languages. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pp. 2519–2528 (2017)

15. Turchin, P., Peiros, I., Murray, G.M.: Analyzing genetic connections between lan- guages by matching consonant classes. Vestnik RGGU. Seriya ”Filologiya. Voprosy yazykovogo rodstva”, (5 (48)) (2010)

Late-Breaking Work

Dissertations from Uppsala University 1602-1855 on the internet

Anna Cecilia Fredriksson

Uppsala University, Uppsala University Library

At Uppsala University Library, a long-term project is under way which aims at making the dissertations, that is theses, submitted at Uppsala University in 1602-1855 easy to find and read on the Internet. The work includes metadata production, scanning and OCR processing as well as publication of images of the dissertations in full-text searchable pdf files. So far, approximately 3,000 dissertations have been digitized and made accessible on the Internet via the DiVA portal, Uppsala University’s repository for research publications. All in all, there are about 12,000 dissertations of about 20 pages each on average to be scanned. This work is done by hand, due to the age of the material. The project aims to be completed in 2020.

Why did we prioritize dissertations?

Even before the project started, dissertations were valued research material, and the physical dissertations were frequently on loan. Their popularity was primarily due to the fact that generally, studying university dissertations is a great way to study evolvements and changes in society. In the same way as doctoral theses do today, the older dissertations reflect what was going on in the country, at the University, and in the intellectual Western world on the whole at a certain period of time. The great mass of them makes them especially suitable for comparative and longitudinal studies, and provides excellent chances for scholars to find material little used or not used at all in previous research.

Swedish older dissertations including those of today’s Finland specifically are also comparatively easy to find. In contrast to many other European libraries with an even longer history, collectors published bibliographies of Swedish dissertations as far back as 250 years ago. Our dissertations are also organized, bound and physically easily accessible. Last year the cataloguing of the Uppsala dissertations was completed according to modern standards in LIBRIS. That made them searchable according to subject and word in title, which was not possible before. All this made the digitization process smoother than that of many other kinds of cultural heritage material. The digital publication of the dissertations naturally made access to them even easier for University staff and students as well as lifelong learners in Sweden and abroad.

How are the dissertations used today?

In actual research today, we see that the material is frequently consulted in all fields of history. Dissertations provide scholars in the fields of history of ideas and history of science with insight into the status of a certain subject matter in Sweden in various periods of time, often in relation to the contemporary discussion on the European continent. The same goes for studies in history of literature and history of religion. Many of the dissertations examine subjects that remain part of the public debate today, and are therefore of interest for scholars in the political and social sciences. The languages of the dissertations are studied by scholars of Semitic, Classical and Scandinavian languages, and the dissertations often contain the very first editions and translations of certain ancient manuscripts in Arabic and Runic script. There is also a social dimension of the dissertations worthy of attention, as dedications and gratulatory poems in the dissertations mirror social networks in the educated stratum of Sweden in various periods of time. Illustrations in the dissertations were often made by local artists or the students themselves, and the great mass of gratulatory poems mirrors the less well-known side of poetry in early modern Sweden.

Our users

The users of the physical items are primarily university scholars, primarily our own University, but there is also quite a great deal of interest from abroad. Not least from our neighboring country Finland and from the Baltic States, which were for some time within the Swedish realm. Many projects are going on right now which include our dissertations as research material or which have them as their primary source material; Swedish projects as well as international. As Sweden as a part of learned Europe more or less shared the values, objects and methods of the Western academic world as a whole, to study Swedish science and scholarship is to study an important part of Western science and scholarship.

As for who uses our digital dissertations, we in fact do not know. The great majority of the dissertations are written in Latin, as in all countries of Europe and North America, Latin was the vehicle for academic discussion in the early modern age. In the first half of the 19th century, Swedish became more common in the Uppsala dissertations. Among the ones digitized and published so far, a great deal are in Swedish. As for the Latin ones, they too are clearly much used. Although knowledge of Latin is quite unusual in Sweden, foreign scholars in the various fields of history often had Latin as part of their curriculum. Obviously, our users know at least enough Latin to recognize if a passage treats the topic of their interest. They can also identify which documents are important to them and extract the most important information from it. If the document is central, it is possible to hire a translator.

But we believe that we also reach out to the lifelong learners, or the so-called “ordinary people”. The older dissertations examine every conceivable subject and they offer pleasant reading even for non-specialists, or people who use the Internet for genealogical research. The full text publication makes the dissertation show up, perhaps unexpectedly, when a person is looking for a certain topic or a certain word. Whoever the users the digital publication of the dissertations has been well received, and far beyond expectations. The first three test years of approximately 2,500 digitized dissertations published resulted in close to one million visits and over 170,000 downloads, i.e. over 4,700 per month. Even if we don’t – or perhaps because we don’t – either offer or demand advanced technologies for the use of these dissertations.

The digital publication and the new possibilities for research

The database in which the dissertations are stored and presented is the same database in which researchers, scholars and students of Uppsala University, and other Swedish universities, too, currently register their publications with the option to publish them digitally. This clears a path for new possibilities for researchers to become aware of and study the texts. Most importantly, it enables users to find documents in their field, spanning a period of 400 years in one search session. A great deal of the medical terms of diseases and body parts, chemical designations, and, of course, juridical and botanical terms are Latin and the same as were used 400 years ago, and can thus be used for localizing text passages on these topics. But the form of the text can be studied, too. Linguists would find it useful to make quantitative studies of the use of certain words or expressions, or just to find the words of interest for further studies. The usefulness of full-text databases are all known to us. But often one as a user gets either a well-working search system or a great mass of important texts, and seldom both. This problem is solved here by the interconnection between the publication database DiVA and the Swedish National Research Library System LIBRIS. The combination makes it possible to use an advanced search system with high functionality, thus reducing the Internet problem of too many irrelevant hits. It gives direct access to the digital full text in DiVA, and the option to order the physical book if the scholar needs to see the original at our library. Not least important, there is qualified staff appointed to care for the system’s long-term maintenance and updates, as part of their everyday tasks at the University Library. Also, the library is open for discussion with users.

The practical work within the project and related issues

As part of the digitization project, the images of the text pages are OCR-processed in order to create searchable full-text pdf files. The OCR process gives various results depending on the age and the language of the text. The OCR processing of dissertations in Swedish and Latin from ca. 1800 onwards results in OCR texts with a high degree of accuracy, that is, between 80 and 90 per cent, whereas older dissertations in Latin and in languages written in other alphabets will contain more inaccuracies. On this point we are not satisfied. Almost perfect results when it comes to the OCR-read text, or proof-reading, is a basic requirement for the full use and potential of this material. However, in this respect, we are dependent upon the technology which is available on the market, as this provides the best and safest product. These products were not developed for handling printing types of various sorts and sizes from the 17th and 18th centuries, and the development of these techniques, except when it comes to “Fraktur”, is slow or non-existing.

If you want to pursue further studies of the documents, you can download the documents for free to your own computer. There are free programs on the Internet that help you merge several documents of your choice into one document, in order for you to be able to search through a certain mass of text. If you are searching for something very particular, you could of course also perform a word search in Google. One of our wishes for the future is to make it possible for our users to search in several documents of their specific choice at one time, without them having to download the documents to their computer.

So, most important for us today within the dissertation project:

1) Better OCR for older texts

2) Easier ways to search in a large text mass of your own choice.

Future use and collaboration with scholars and researchers

The development of digital techniques for the further use of these texts is a future desideratum. We therefore aim to increase our collaboration with researchers who want to explore new methods to make more out of the texts. However, we always have to take into account the special demands from society when it comes to the work we, as an institute of the state, are conducting – in contrast to the work conducted by e.g. Google Books or research projects with temporary funding.

We are expected to produce both images and metadata of a reasonably high quality – a product that the University can ‘stand for’. What we produce should have a lasting value – and ideally be possible to use for centuries to come.

What we produce should be compatible with other existing retrieval systems and library systems. Important, in my opinion, is reliability and citability. A great problem with research on digitally borne material is, in my opinion, that it constantly changes, with respect to both their contents and where to find them. This puts the fundamental principle of modern science, the possibility to control results, out of the running. This is a challenge for Digital Humanities which, with the current pace of development, surely will be solved in the near future.

Late-Breaking Work

Normalizing Early English Letters for Neologism Retrieval

Mika Hämäläinen, Tanja Säily, Eetu Mäkelä

University of Helsinki

Introduction

Our project studies social aspects of innovative vocabulary use in early English letters. In this abstract we describe the current state of our method for detecting neologisms. The problem we are facing at the moment is the fact that our corpus consists of non-normalized text. Therefore, spelling normalization is the first step we need to solve before we can apply automatic methods to the whole corpus.

Corpus

We use CEEC (Corpora of Early English Correspondence) [9] as the corpus for our research. The corpus consists of letters ranging from the 15th century to the 19th century and it represents a wide social spectrum, richly documented in the metadata associated with the corpus, including information on e.g. socioeconomic status, gender, age, domicile and the relationship between the writer and recipient.

Finding Neologisms

In order to find neologisms, we use the information of the earliest attestation of words recorded in the Oxford English Dictionary (OED) [10]. Each lemma in the OED has information about its attestations, but also variant spelling forms and inflections.

How we proceed in automatically finding neologism candidates is as follows. We get a list of all the individual words in the corpus, and we retrieve their earliest attestation from the OED. If we find a letter where the word has been used before the earliest attestation recorded in the OED, we are dealing with a possible neologism, such as the word "monotonous" in (1), which antedates the first attestation date given in the OED by two years (1774 vs. 1776).

(1) How I shall accent & express, after having been so long cramped with the monotonous impotence of a harpsichord! (Thomas Twining to Charles Burney, 1774; TWINING_017)

The problem, however, is that our corpus consists of texts written in different time periods, which means that there is a wide range of alternative spellings for words. Therefore, a great part of the corpus cannot be directly mapped to the OED.

Normalizing with the Existing Methods

Part of the CEEC (from the 16th century onwards) has been normalized with VARD2 [3] in a semi-automated manner; however, the automatic normalization is only applied to sufficiently frequent words, whereas neologisms are often rare words. We take these normalizations and extrapolate them over the whole corpus. We also used MorphAdorner [5] to produce normalizations for the words in the corpus. After this, we compared the newly normalized forms with those in the OED taking into account the variant forms listed in the OED. NLTK's [4] lemmatizer was used to produce lemmas from the normalized inflected forms to map them to the OED. In doing so, we were able to map 65,848 word forms of the corpus to the OED. However, around 85,362 word forms still remain without mapping to the OED.

Different Approaches

For the remaining non-normalized words, we have tried a number of different approaches.

- Rules

- SMT

- NMT

- Edit distance, semantics and pronunciation

The simplest one of them is running the hand-written VARD2 normalization rules for the whole corpus. These are simple replacement rules that replace a sequence of characters with another one either in the beginning, end or middle of a word. An example of such a rule is replacing "yes" with "ies" at the end of the word.

We have also trained a statistical machine translation model (with Moses [7]}) and a neural machine translation model (with OpenNMT [6]). SMT has previously been used in the normalization task, for example in [11]. Both of the models are character based treating the known non-normalized to normalized word pairs as two languages for the translation model. The language model used for the SMT model is the British National Corpus (BNC) [1].

One more approach we have tried is to compare the non-normalized words to the ones in the BNC by Levenshtein edit distance [8]. This results in long lists of normalization candidates, that we filter further by their semantic similarity, which means comparing the list of two word appearing immediately after and before the non-normalized word and the normalization candidates picking out the candidates with largest number of shared contextual words. And finally, filtering this list with Soundex pronunciation by edit distance. A similar method [2] has been used in the past for normalization which relied on the semantics and edit distance.

The Open Question

The above described methods produce results of varying degrees of success. However, none of them is reliable enough to be trusted above the rest. We are now in a situation in which at least one of the approaches finds the correct normalization most of the time. The next unsolved question is how to pick the correct normalization from the list of alternatives in an accurate way.

Once the normalization has been solved, we are facing another problem which is mapping words to the OED correctly. For example, currently the verb "to moon" is mapped to the noun "mooning" recorded in the OED because it appeared in the present participle form in the corpus. This means that in the future, we have to come up with ways to tackle not only the problem of homonyms, but also the problem of polysemy. A word might have acquired a new meaning in one of our letters, but we cannot detect this word as a neologism candidate, because the word has existed in the language in a different meaning before.

References

1. The British National Corpus, version 3 (BNC XML Edition). Distributed by Bodleian Libraries, University of Oxford, on behalf of the BNC Consortium (2007),http://www.natcorp.ox.ac.uk/

2. Amoia, M., Martinez, J.M.: Using comparable collections of historical texts forbuilding a diachronic dictionary for spelling normalization. In: Proceedings of the7th workshop on language technology for cultural heritage, social sciences, andhumanities. pp. 84–89 (2013)

3. Baron, A., Rayson, P.: VARD2: a tool for dealing with spelling variation in histor-ical corpora (2008)

4. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’ReillyMedia (2009)

5. Burns, P.R.: Morphadorner v2: A java library for the morphological adornment ofEnglish language texts. Northwestern University, Evanston, IL (2013)

6. Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.M.: OpenNMT: Open-SourceToolkit for Neural Machine Translation. ArXiv e-prints

7. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N.,Cowan, B., Shen, W., Moran, C., Zens, R., et al.: Moses: Open source toolkit forstatistical machine translation. In: Proceedings of the 45th annual meeting of theACL on interactive poster and demonstration sessions. pp. 177–180. Associationfor Computational Linguistics (2007)

8. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, andreversals. In: Soviet physics doklady. vol. 10, pp. 707–710 (1966)

9. Nevalainen,T.,Raumolin-Brunberg,H.,Ker ̈anen,J.,Nevala,M.,Nurmi, A., Palander-Collin, M.: CEEC, Corpus of Early English Cor-respondence. Department of Modern Languages, University of Helsinki,http://www.helsinki.fi/varieng/CoRD/corpora/CEEC/

10. OED: OED Online. Oxford University Press, http://www.oed.com/

11. Pettersson, E., Megyesi, B., Tiedemann, J.: An SMT approach to automatic an-notation of historical text. In: Proceedings of the workshop on computational his-torical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALTProceedings Series 18. pp. 54–69. No. 087, Link ̈oping University Electronic Press(2013)

Late-Breaking Work

Triadic closure amplifies homophily in social networks

Aili Asikainen¹, Gerardo Iñiguez², Kimmo Kaski¹, Mikko Kivelä¹

¹Aalto University, Finland; ²Next Games, Finland

Much of the structure in social networks can be explained by two seemingly separate network evolution mechanisms: triadic closure and homophily. While it is typical to analyse these mechanisms separately, empirical studies suggest that their dynamic interplay can be responsible for the striking homophily patterns seen in real social networks. By defining a network model with tunable amount of homophily and triadic closure, we find that their interplay produces a myriad of effects such as amplification of latent homophily and memory in social networks (hysteresis). We use empirical network datasets to estimate how much observed homophily could actually be an amplification induced by triadic closure, and have the networks reached a stable state in terms of their homophily. Beyond their role in characterizing the origins of homophily, our results may be useful in determining the processes by which structural constraints and personal preferences determine the shape and evolution of society.

4:00pm - 4:15pm
Short Paper (10+5min) [abstract]

The Science of Sub-creation: Transmedial World Building in Fantasy-Based MMORPGs

Rebecca Anderson

University of Waterloo, The Games Institute, First Person Scholar

My paper examines how virtual communities are created by fandoms in massively multi-player online role-playing games and it explores what kinds of self-construction emerge in these digital locales and how such self-construction reciprocally affects the living culture of the game. I assert that the universe of a fantasy-based MMORPG necessitates participatory culture: experiencing the story means participating in the culture of the story’s world; these experiences reciprocally affect the living culture of the game’s universe. The participation and investment of readers, viewers, and players in this world constitute what Carolyn Marvin calls a textual community or a group that “organize[s] around a presumptively shared, but distinctly practiced, epistemology of texts and interpretive procedures” (12). In other words, the textual community produces a shared discourse, one that informs and interrogates what it means to be a fan in both analogue and digital environments.

My paper uses J.R.R. Tolkien’s Middle-earth as a case study to explore the creation and continuation of a fantastic universe, in this case Middle-earth, across mediums: a transmedial creation informed by its textual community. Building on the work of Mark J.P. Wolf, Colin B. Harvey, Celia Pearce, Matthew P. Miller, and Edward Castronova, my work reveals that the “worldness” of a transmedia universe, or the degree to which it exists as a complete and consistent cosmos, plays a core role in the production, acceptance, and continuation of its ontology among and across the fan communities respective to the mediums in which it operates. My paper argues that Tolkien’s literary texts and these associated adaptations are multi-participant sites in which participants negotiate their sense of self within a larger textual community. These multi-participant sites form the basis from which to investigate the larger social implications of selfhood and fan participation.

My theoretical framework provides the means by which to situate the critical aesthetics relative to how this fictional universe draws participants in. Engaging with Gordon Calleja’s discussions on immersion and Luis O. Arata’s thoughts on interactivity, I demonstrate how the transmedial storyworld of Middle-earth not only constructs a sense of space but that it is precisely this sense of space that engages the reader, viewer or gamer. To situate the sense of self incurred between and because of narrative and storyworld environment, I draw from Andreas Gregersen’s work on embodiment and interface, as well as from Shawn P. Wilbur’s work on identity in virtual communities. Anne Balsamo and Rebecca Borgstrom each offer a theorization of the role-playing specific to the multiplayer environments of game-based adaptations, while William H. Huber’s work contextualizes the production of space in epic fantasy narratives. Together, my theoretical framework highlights how the spread of a transmedial fantastic narrative impacts the connection patterns across the textual community of a particular storyworld, as well as foregrounds how the narrative environment shapes the degree of participant engagement in and with the space of that storyworld.

This proposal is for a long paper presentation; however, I'm able to condense if necessary to fit a short paper presentation.

4:15pm - 4:30pm
Distinguished Short Paper (10+5min) [abstract]

Layers of History in Digital Games

Derek Fewster

University of Helsinki,

The past five years have seen a huge increase in historical games studies. Quite a few texts have tried to approach how history is presented and used in games, considering everything from philosophical points to more practical views related to historical culture and the many manifestations of heritage politics. The popularity of recent games like Assassin’s Creed, The Witcher and Elder Scrolls also manifests the current importance of deconstructing the messages and choices the games present. Their impact on the modern understanding of history, and the general idea of time and change, is yet to be seen in its full effect.

The paper at hand is an attempt to structure the many layers or horizons of historicity in digital games as these, into a single taxonomic system for researchers. The suggestion considers the various consciousnesses of time and narrative models modern games work with. Several distinct horizons of time, both of design and of the related real life, are interwoven to form the end product. The field of historical game studies could find this tool quite useful, in its urgent need to systematize how digital culture is reshaping our minds and pasts.

The model considers aspects like memory culture, uses of period art and apocalyptic events, narrative structures, in-game events and real world discourses as parts of how a perception of time and history is created or adapted. The suggested “layering of time” is applicable on a wide scale of digital games.

4:30pm - 4:45pm
Short Paper (10+5min) [abstract]

Critical Play, Hybrid Design and the Performance of Cultural Heritage Game/Stories

Lissa Holloway-Attaway

University of Skövde

In my talk, I propose to discuss the critical relationship between games designed and developed for cultural heritage and emergent Digital Humanities (DH) initiatives that focus on (re-)inscribing and reflecting on the shifting boundaries of human agency and its attendant relations. In particular, I will highlight theoretical and practical humanistic models (for development and as objects of scholarly research) that are conceived in tension with more computational emphases and influences. I examine how digital heritage games move us from an understanding of digital humanities as a “tool” or “text” oriented discipline to one where we identify critical practices that actively engage and promote convergent, hybrid and ontologically complex techno-human subjects to enrich our field of inquiry as DH scholars.

Drawing on principles such as embodiment, affect, and performativity, and analyzing transmedial storytelling and mixed reality games designed for heritage settings (and developed in my university research group), I argue for these games as an exemplary medium for enriching interdisciplinary digital humanities practices using methods currently called upon by recent DH scholarship. In these fully hybrid contexts where human/technology boundaries are richly intermingled, we recognize the importance of theoretical approaches for interpretation that are performative, not mechanistic (Drucker, in Gold, 2011): That is we look at emergent experiences, driven by human intervention, not affirmed by technological development and technical interface affordances. Such hybridity, driven by human/humanities approaches is explored more fully, for example, in Digital_Humanities by Burdick et al (2012) and by N. Katherine Hayles in How We Think: Digital Media and Contemporary Technogenesis (2012). Collectively these scholars reveal how transformative and emerging disciplines can work together to re-think the role of the organic-technical beings at the center (and found at the margins and in-between subjectivities) within new forward-thinking DH studies. Currently, Hayles and others, like Matthew Gold (2012) offer frameworks for more interdisciplinary Digital Humanities methods (including Comparative Media and Culture Studies approaches) that are richly informed by investigations into the changing role and function of the user of technologies and media and the human/social contexts for use. Hayles, for example, explicitly claims that in Digital Humanities humans “ think, through, with, and alongside media” (1). In essence, our thinking and being, our digitization and our human-ness are mutually productive and intertwined. Furthermore, we are multisensory in our access to knowing and we develop an understanding of the physical world in new ways that reorient our agencies and affects, redistributing them for other encounters with cultural and digital/material objects that are now ubiquitous and normalized.

Ross Parry, museum studies scholar, supports a similar model for inquiry and future advancement, based on the premise that digital tool use is now fully implemented and accepted in museum contexts, and so now we must deepen and develop our inquiries and practice (Parry, 2013). He claims that digital technologies have become normative in museums and that currently we find ourselves, then, in the age of the postdigital. Here critical scrutiny is key and necessary to mark this advanced state of change. For Parry this is an opportune, yet delicate juncture that requires a radical deepening of our understanding of the museums’ relationship to digital tools:

Postdigitality in the museum necessitates a rethinking of upon what museological and digital heritage research is predicated and on how its inquiry progresses. Plainly put, we have a space now (a duty even) to reframe our intellectual inquiry of digital in the museum to accommodate the postdigital condition. [Parry, 36]

For Parry, as with current DH calls for development, we must now focus on the contextualized practices in which these technologies will inevitably engage designers and users and promote robust theoretical and practical applications.

I argue that games, and in particular digital games designed for heritage experiences, are unique training grounds for such postdigital future development. They provide rich contexts for DH scholars working to deepen their understanding of performative and active interventions and intra-actions beyond texts and tools. As digital games have been adopted and ubiquitously assimilated in museums and heritage sites, we have opportunities to study experiences of users as they performatively engage postdigital museum sites through rich forms of hybrid play. In such games, nuanced forms of interdisciplinary communication and storytelling happen in deeply integrated and embedded user/technology relationships. In heritage settings, interpretation is key to understanding histories from multiple user-driven perspectives, and it happens in acts of dynamic emergence, not as the result of mechanistic affordance. As such DH designers and developers have much to learn from a rich body of games and heritage research, particularly that focused on critical and rhetorical design for play, Mixed Reality (MR) approaches and users’ bodies as integral to narrative design (Anderson et. al, 2010; Bogost, 2010; Flanagan, 2013; Mortara et. al, 2014; Rouse et. al, 2015; Sicart, 2011). MR provides a uniquely layered approach working across physical and digital artifacts and spaces, encouraging polysemic experiences that can support curators’ and historians’ desires to tell ever more complex and connected stories for museum and heritage site visitors, even involving visitors’ own voices in new ways. In combination, critical game design approaches and MR technologies, within the museum context, help re-center historical experience on the visitor’s body, voice, and agency, shifting emphasis away from material objects, also seen as static texts or sites for one-way, broadcast information. Re-centering the design on users’ embodied experience with critical play in mind, and in MR settings, offers rich scholarship for DH studies and provides a variety of heritage, museum, entertainment, and participatory design examples to enrich the field of study for open, future and forward thinking.

Drawing on examples from heritage games developed within my university research group and in the heritage design network I co-founded, and implemented in museum and heritage sites, I will work to expose these connections. From transmedial children’s books focused on Nordic folktales, to playful AR experiences that expose the history of architectural achievements, as well as the meta reflections on the telling of those achievements in archival documentations (such as the development of the Brooklyn Bridge in the 19th C) I will provide an overview of how digital heritage games, in combination with new hybrid DH initiatives can be used for future development and research. This includes research around new digital literacies, collaborative and co-design approaches (with users) and experimental storytelling and narrative approaches for locative engagement in open-world settings, dependent on input from user/visitors.

References

Anderson, E. F., McLoughlin, L., Liarokapis, F., Peters, C., Petridis, P., de Freitas, S.

Developing Serious Games for Cultural Heritage: A State-of-the-Art Review. In: Virtual Reality 14 (4). (2010)

Burdick, A., Drucker, J., Lunenfeld, P., Presner, T., Schnapp, J. Digital_Humanities. MIT Press, Cambridge, MA (2012)

Bogost, I. Persuasive Games: The Expressive Power of Videogames. MIT Press, Cambridge MA (2010)

Flanagan, M. Critical Play: Radical Game Design. MIT Press, Cambridge MA (2013)

Gold, M. K. Debates in the Digital Humanities. University of Minnesota Press, Minneapolis, MN (2012)

Hayles, K. N. How We Think: Digital Media and Contemporary Technogenesis. Chicago, University of Chicago Press, Chicago Il (2012)

Parry, R. The End of the Beginning: Normativity in the Postdigital Museum. In: Museum Worlds: Advances in Research, vol. 1, pgs. 24-39. Berghahn Books (2013)

Mortara, M., Catalano, C.E., Bellotti, F., Fiucci, G., Houry-Panchetti, M., Panagiotis, P. Learning Cultural Heritage by Serious Games. In: Journal of Cultural Hertiage, vol. 15, no. 3, pp. 318-325. (2014)

Rouse, R., Engberg, M., JafariNaimi, N., Bolter, J. D. (Guest Eds.) Special Section:

Understanding Mixed Reality. In: Digital Creativity, vol. 26, issue 3-4, pp. 175-227. (2015)

Sicart, M. The Ethics of Computer Games. MIT Press, Cambridge MA (2011)

4:45pm - 5:00pm
Short Paper (10+5min) [publication ready]

Researching Let’s Play gaming videos as gamevironments

Xenia Zeiler

University of Helsinki

Let’s Plays, as a specific form of gaming videos, are a rather new phenomenon and it is not surprising that they are still relatively under-researched. So far, only a few publications focus on the theme. The specifics of Let’s Play gaming videos make them an unparalleled object of research in the vicinity of games – in the so-called gamevironments. The theoretical and methodical approach of the same name and literally merging the terms “games/gaming” – “environments” is first mentioned and discussed by Radde-Antweiler, Waltemathe and Zeiler 2014 who argue to broaden the study of video games, gaming and culture beyond media-centred approaches to better highlight recipient perspectives and actor-centred research. Gamevironments thus puts the spotlight on actors in their mediatized – and specifically gametized – life.

5:00pm - 5:15pm
Short Paper (10+5min) [abstract]

The plague transformed: City of Hunger as mutation of narrative and form

Jennifer J Dellner

Ocean County College, United States of America,

This short paper proposes and argues the hypothesis that Minna Sundberg’s interactive game in development, City of Hunger, an offshoot or spin-off of her well respected digital comic, Stand Still Stay Silent, can be understood in terms of the ecology of the comic as a mutation of it; as such, her appropriation of a classic game genre and her storyline’s emphasis on the mechanical over the natural suggest promising avenues for understanding the uses of interactivity in the interpretation of narrative. In the game, the plague-illness of the comic’s ecology may or may not be gone, but conflict (vs. cooperation) becomes the primary mode of interaction for characters and reader-players alike. In order to produce the narrative, the reader-player will have to do battle as the characters do. Sundberg herself signals that her new genre is indivisible from the different ecology of the game world’s narrative. “City of Hunger will be a 2d narrative rpg with a turn-based battle system, mechanically inspired by your older final fantasy games, the Tales of-series and similar classical rpg's.” There will be a world of “rogue humans, mechanoids and mysterious alien beings to fight” (2017). While it remains to be seen how the game develops, its emphasis on machine-beings and aliens in a classic game environment ( a “shadow of the past”) suggests strongly that the use of interactivity within each narrative has an interpretive and not merely performative dimension.

5:15pm - 5:30pm
Short Paper (10+5min) [abstract]

Names as a Part of Game Design

Lasse Hämäläinen

University of Helsinki,

Video games often consist of several separate spaces of play. They are called, depending on the speaker and the type of the game, for example levels, maps, tracks or worlds. In this paper, the term level is used. As there are usually many levels in a game, they need some kind of identifying elements. In some games, levels only have ordinal numbers (Level 1, Level 2 etc.), but in the other, they (also) have names.

Names are an important part of game design, at least for three reasons. Firstly, giving names to places makes the imaginary world feel richer and deeper (Schell 2014: 351), improving the gameplay experience. Secondly, name gives the player first impression of the level (Rogers 2014: 220), helping him/her to perceive the level’s structure. And thirdly, level names are needed for discussing the levels. Members of a gaming community often want to share their experiences and emotions of the gameplay. When doing so, it is important to contextualize the events: in which level did X happen?

Even though some game design scholars seem to recognize the importance of names, there are very few studies of them. This presentation is aimed to fill this blank. I have analyzed level names in Playforia Minigolf, an online minigolf game designed in Finland in 2002. The data include names all the 2,072 levels in the game. The analysis focuses especially on the principles of naming, or in other words, what kind of connection there is between the name and level’s characteristics.

The presentation also examines the change of naming practices during the game’s 15-year history. The oldest names mostly describe the levels in a simple, neutral manner, while the newest names are far more ambiguous and rarely have anything to do with level’s characteristics. This change is probably caused by the change of level designers. First levels of the game were designed by its developers, game design professionals, but over time, the responsibility of designing levels has passed to the most passionate hobbyists of the game. This result might be an interesting for game studies and especially for the research of modding and modifications (see e.g. Unger 2012).

REFERENCES

Playforia (2002). Minigolf. Finland: Apaja Creative Solutions Oy.

Rogers, Scott (2014). Level Up! The Guide to Great Video Game Design. Chichester: Wiley.

Schell, Jesse (2014). The Art of Game Design: A Book of Lenses. CRC Press.

Unger, Alexander (2012). Modding as a Part of Gaming Culture. – Fromme, Johannes & Alexander Unger (eds.): Computer Games and New Media Cultures. A Handbook of Digital Games Studies, 509–523.

Date: Friday, 09/Mar/2018
9:00am - 9:15am	Introduction to the Digital & Critical Friday
Think Corner	Introduction to the Digital & Critical Friday
9:15am - 10:30am	Plenary 3: Caroline Bassett Session Chair: Johanna Sumiala ‘In that we travel there’ – but is that enough?: DH and Technological Utopianism. Watchable also remotely from PII, PIV and P674.
Think Corner
11:00am - 12:00pm	F-TC-1: Data, Activism and Transgression Session Chair: Marianne Ping Huang
Think Corner
	11:00am - 11:30am Long Paper (20+10min) [abstract] Shaping data futures: Towards non-data-centric data activism Minna Ruckenstein¹, Tuukka Lehtiniemi² ¹Consumer Society Research Centre, University of Helsinki, Finland,; ²HIIT, Aalto University The social science debate that attends to the exploitative forces of the quantification of aspects of life previously experienced in qualitative form, recognising the ubiquitous forms of datafied power and domination, is by now an established perspective to question datafication and algorithmic control (Ruckenstein and Schüll, 2017). Drawing from the critical political economy and neo-Foucauldian analyses researchers have explored the effects of the datafication (Mayer-Schönberger and Cukier. 2013; Van Dijck, 2014) on the economy, public life, and self-understanding. Studies alert us to threats to privacy posed by “dataveillance” (Raley, 2012; Van Dijck, 2014), forms of surveillance distributed across multiple interested parties, including government agencies, insurance payers, operators, data aggregators, analytics companies, and individuals who provide the information either knowingly or unintentionally when going online, using self-tracking devices, loyalty programs, and credit cards. The “data traces” add to the data accumulated in databases and personal data – any data related to a person or resulting from actions by a person – becomes utilized for business and societal purposes in an increasingly systematic matter (Van Dijck and Poell, 2016; Zuboff, 2015). In this paper, we take an “activist stance”, aiming to contribute to the current criticism of datafication with a more participatory and collaborative approach offered by “data activism” (Baack 2015; Milan and van der Velden, 2016), and civic and political engagement spurred by datafication. The various data-driven initiatives currently under development suggest that the problematic aspects of datafication, including the tension between data openness and data ownership (Neff, 2013), the asymmetries in terms of data usage and distribution (Wilbanks and Topol, 2016; Kish and Topol, 2015) and the inadequacy of existing informed consent and privacy protections (Sharon, 2016) are by now not only well recognized, but they are generating new forms of civic and political engagement and activism. This calls for more debate on what these new forms of data activism are and how scholars in the humanities and social science communities can assess them. By relying on the approaches developed within the field of Techno-Anthropology (Børsen and Botin, 2013; Ruckenstein and Pantzar, 2015), seeking to translate and mediate knowledge concerning complex technoscientific projects and aims, we positioned ourselves as “outside insiders” with regard to a data-centric initiative called MyData. In 2014, we became observers and participants of the MyData, promoting the understanding that people benefit when they can control data gathering and analysis by public organizations and businesses and become more active data citizens and consumers. The high-level MyData vision, described in ‘the MyData white paper’ written primarily by researchers at the Helsinki Institute for Information Technology and the Tampere University of Technology (Poikola et al., 2015), outlines an alternative future that transforms the ’organisation-centric system‘ into ’a human-centric system‘ that treats personal data as a resource that the individual can access, control, benefit and learn from. The paper discusses “our” data activism and the activism of technology developers, promoting and relying on two different kinds of “social imaginaries” (Taylor, 2004). By doing so, we open a perspective to data activism that highlights ideological and political underpinnings of contested social imaginaries and aims. Current data-driven initiatives tend to proceed with a social imaginary that treats data arrangements as solutions, or corrective measures addressing unsatisfactory developments. They advance a logic of an innovation culture, relying on the development of new technology structures and computationally intensive tools. This means that the data-driven initiatives rely on an engineering attitude that does not question the power of technological innovation for creating better societal solutions or, more broadly, the role of datafication in societal development. The main focus is on the correct positioning of technology: undesirable, or harmful developments need to be reversed, or redirected towards ethically more fair and responsible practices. Since we do not possess impressive technology skills, or proficiency in legal and regulatory matters, which would have aligned us with the innovation-driven data activism, our position in the technology-driven data activism scene is structurally fairly weak. Our data activism is informed by a sensitivity to questions of cultural change and the critical stance representative to social scientific inquiry, questioning the optimistic and future-oriented social imaginary of technology developers. As will be discussed in our presentation, this means that our data activism is incompatible with those of technology developers in a profound sense, explaining why our activist role was repeatedly reduced to viewing a stream of diagrams on PowerPoint slides depicting databases and data flows. In terms of designing future data transfers and data flows, our social imaginary remained oddly irrelevant, intensifying the feeling that we were observing a moving target and our task was to simply keep up, while the engineers were busy doing to the real work of activists, developing approaches that give users more control over their personal data, such as the Kantara Initiative’s User-Managed Access (UMA) protocol, experimenting with Blockchain technologies for digital identities such as Sovrin, and learning about “Vendor Relationship Management” systems (see, Belli et al., 2017). From the outsider position, we started to craft a narrative about the MyData initiative that aligns with our social imaginary. We wanted to push the conversation further, beyond the usual technological, legal and policy frameworks, and suggest that with its techno-optimism the current MyData work might actually weaken data activism and public support for it. We turned to literary and scholarly sources with the aim of opening a critical, but hopefully also a productive conversation about MyData in order to offer ideas of how to promote socially more robust data activism. A seminal text that shares aims of the MyData initiative is the Autonomous Technology – Technics-out-of-Control as a Theme in Political Thought (1978) by Langdon Winner. Winner perceives the relationship between human and technology in terms of Kantian autonomy: via analysis of interrelations of independence and dependence. The core ideas of the MyData vision have particular resonance with the way Winner (1978) considers “reverse adaptation”, wherein the human adapts to the power of the system and not the other way around. In this paper, we first describe the MyData vision, as it has been presented by the activists, and situate it in the framework of technology critique and current critique of the digital culture and economy. Here, we demonstrate that the outside position can, in fact, resource a re-articulation of data activism. After this, we detail some further developments in the MyData scene and possibilities that have opened for dialogue and collaboration during our data activism journey. We end the discussion by noting that for truly promoting societally beneficial data arrangements, work is needed to circumvent the individualistic and data-centric biases of initiatives such as the MyData. We promote non-data-centric data activism that meshes critical thinking into the mundane realities of everyday practices and calls for historically informed and collectively oriented alternatives and action. Overall, our goal is to demonstrate that with a focus on ordinary people, professionals and communities of practice, ethnographic methods and practice-based analysis can deepen understandings of datafication by revealing how data and its technologies are taken up, valued, enacted, and sometimes repurposed in ways that either do not comply with imposed data regimes, or mobilize data in inventive ways (Nafus & Sherman, 2014). By learning about everyday data worlds and actual material data practices, we can strengthen the understanding of how data technologies could become a part of promoting and enacting more responsible data futures. Paradoxically, in order to arrive to an understanding of how data initiatives support societally beneficial developments, non-data-centric data activism is called for. By aiming at non-data-centric data activism, we can continue to argue against triumphant data stories and technological solutionism in ways that are critical, but do not deny the possible value of digital data in future making. We will not try to protect ourselves against data forces but act imaginatively with and within them to develop new concepts, frameworks and collaborations in order to better steer them. References Baack, S. 2015. Datafication and empowerment: How the open data movement re-articulates notions of democracy, participation, and journalism. Big Data & Society, Oct. Belli, L., Schwartz, M., & Louzada, L. (2017). Selling your soul while negotiating the conditions: from notice and consent to data control by design. Health and Technology, 1-15. Børsen, T. & Botin, L. (eds) (2013). What Is Techno-Anthropology? Aalborg, Denmark: Aalborg University Press. Kish, L. J., & Topol, E. J. (2015). Unpatients: why patients should own their medical data. Nature biotechnology, 33(9), 921-924. Mayer-Schönberger, V., and K. Cukier. (2013). Big data: a revolution that will transform how we live, work, and think. Boston: Houghton Mifflin Harcourt. McQuillan, D. (2016). Algorithmic Paranoia and the Convivial Alternative. Big Data and Society 3(2). McStay, Andrew (2013). Privacy and Philosophy: New Media and Affective Protocol. New York: Peter Lang. Milan, S., & Velden, L. V. D. (2016). The alternative epistemologies of data activism. Digital Culture & Society, 2(2), 57-74. Nafus, D. and Sherman, J. (2014). This One Does Not Go Up to 11: The Quantified Self Movement as an Alternative Big Data Practice. International Journal of Communication 8: 1784-1794. Poikola, A.; Kuikkaniemi, K.; & Kuittinen, O. (2014). My Data – Johdatus ihmiskeskeiseen henkilötiedon hyödyntämiseen [‘My Data – Introduction to Human-centred Utilisation of Personal Data’]. Helsinki: Finnish Ministry of Transport and Communications. Poikola, A.; Kuikkaniemi, K.; & Honko, H. (2015). MyData – a Nordic Model for Human-centered Personal Data Management and Processing. Helsinki: Finnish Ministry of Transport and Communications. Raley, R. (2013). Dataveillance and Counterveillance, in ed. Gitelman, Raw Data is an Oxymoron. Cambridge: MIT Press. Ruckenstein, M. & Pantzar, M. (2015). Datafied life: Techno-anthropology as a site for exploration and experimentation. Techné: Research in Philosophy & Technology. 19(2), 191–210. Ruckenstein, M., & Schüll, N. D. (2017). The Datafication of Health. Annual Review of Anthropology, (0). Sharon, T. (2016) Self-Tracking for Health and the Quantified Self: Re-Articulating Autonomy, Solidarity, and Authenticity in an Age of Personalized Healthcare. Philosophy & Technology, 1-29. Taylor, C. (2004). Modern social imaginaries. Duke University Press. Van Dijck, J. (2014). Datafication, dataism and dataveillance: Big data between scientific paradigm and ideology. Surveillance and Society 12(2): 197–208 Van Dijck, J., & Poell, T. (2016) Understanding the promises and premises of online health platforms. Big Data & Society, 3(1), 1-11. Wilbanks, J. T., & Topol, E. J. (2016). Stop the privatization of health data. Nature, 535, 345-348. Winner, L. (1978). Autonomous Technology – Technics-out-of-Control As a Theme in Political Thought. Cambridge, Massachusetts, & London: The MIT Press. Zuboff, Shoshana. 2015. “Big Other: Surveillance Capitalism and the Prospects of an Information Civilization.” Journal of Information Technology 30: 75–89. 11:30am - 11:45am Short Paper (10+5min) [publication ready] Digitalisation of Consumption and Digital Humanities - Development Trajectories and Challenges for the Future Toni Ryynänen, Torsti Hyyryläinen University of Helsinki, Ruralia Institute Digitalisation transforms practically all areas of the modern life: everything that can, will be digitalised. Especially the everyday routines and consumption practices are under continual change. New digital products and services are introduced at an accelerating pace. Purpose of this article is two-fold: the first aim is to explore the influence of digitalisation on consumption, and secondly, to canvas reasons for these digitalisation-driven transformations and possible future progressions. The transformations are explored through recent consumer studies and the future development is based on interpretations about digitalisation. Our article recounts that digitalisation of consumption have resulted in new forms of e-commerce, changing consumer roles and the digital virtual consumption. Reasons for these changes and expected near future progressions are based on assumptions drawn from data-driven, platform-based and disruption-generated visions. Challenges of combining consumption and the digital humanities approach are discussed in the conclusion Section of the article. 11:45am - 12:00pm Short Paper (10+5min) [abstract] Its your data, but my algorithms Tomi Dufva Aalto-University, the school of Arts, Design and Architecture, The world is increasingly digital, but the understanding of how the digital affects everyday life is still often confused. Digitalisation is sometimes optimistically thought as a rescue from hardships, be it economical or even educational. On the other hand, digitalization is seen negatively as something one just can’t avoid. Digital technologies have replaced many previous tools used in work as well as in leisure. Furthermore, digital technologies present an agency of their own into the human processes as marked by David Berry. Through manipulating data through algorithms and communicating not only with humans, but other devices as well, digital technology presents new kind of challenges for the society and individual. These digital systems and data flow get their instructions from the code that runs on these systems. The underneath code itself is not objective nor value-free and carries own biases as well as programmers, software companies or larger cultural viewpoints objectives. As such, digital technology affects to the ways, we structure and comprehend, or are even able to comprehend the world around us. This article looks at the surrounding digitality through an artistic research project. Through using code not as a functional tool but in a postmodern way as a material for expression, the research focuses on how code as art can express the digital condition that might otherwise be difficult to put into words or comprehend in everyday life. The art project consists of a drawing robot controlled by EEG-headband that the visitor can wear. The headband allows the visitor to control the robot through the EEG-readings read by the headband. As such the visitor might get a feeling of being able to control the robot, but at the same time the robot interprets the data through its algorithms and thus controls the visitor's data. The aim of this research projects is to give perspectives to the everydayness of digitality. It wants to question how we comprehend digital in everyday life and asks how we should embody digitality in the future. The benefits of artistic research are in the way it can broaden the conceptions of how we know and as such can deepen one’s understanding of the complexities of the world. Furthermore, artistic research can expand the meaning to alternative interpretations of the research subjects. As such, this research project aims at the same time to deepen the discussion of digitalization and to broaden it to alternative understandings. The alternative ways of seeing a phenomenon, like digitality, are essential in the ways future is developed. The proposed research consists of both the theoretical text and the interactive artwork, which would be present in the conference.
12:45pm - 2:30pm	Poster Slam (lunch continues), Poster Exhibition & Coffee Session Chair: Annika Rockenberger
Think Corner
	Poster [abstract] Shearing letters and art as digital cultural heritage, co-operation and basic research Maria Elisabeth Stubb Svenska litteratursällskapet i Finland, Albert Edelfelts brev (edelfelt.fi) is a web publication developed at the Society of Swedish Literature in Finland. In co-operation with the Finnish National Gallery, we publish letters of the Finnish artist Albert Edelfelt (1854–1905) combined with pictures of his artworks. Albert Edelfelts brev received in 2016 the State Award for dissemination of information. The co-operation between institutions and basic research of the material has enabled a unique reconstruction of Edelfelt’s artistry and his time, for the service of researchers and other users. I will present how we have done it and how we plan to further develop the website. The website Albert Edelfelts brev launched in September 2014, with a sample of Edelfelt’s letters and paintings. Our intention is to publish all the letters Albert Edelfelt wrote to his mother Alexandra (1833–1901). The collection consists of 1 310 letters, that range over 30 years and cover most of Edelfelt’s adult life. The letters are in the care of the Society of Swedish Literature in Finland. We also have to our disposal close to 7 000 pictures of Edelfelt’s paintings and sketches in the care of the Finnish National Gallery. In the context of digital humanities, the volume of the material at hand is manageable. However, for researchers who think that they might have use of the material, but are unsure of exactly where or what to look for, it might be labour intensive to go through all the letters and pictures. We have combined professional expertise and basic research of the material with digital solutions to make it as easy as possible to take part of what the content can offer. As editor of the web publication, I spend a considerable part of my work on basic research in identifying people, and pinpointing paintings and places that Edelfelt mentions in his letters. By linking the content of a letter to artworks, persons, places and subjects/reference words users can easily navigate in the material. Each letter, artwork and person has a page of its own. Even places and subjects are searchable and listed. The letters are available as facsimile pictures of the handwritten pages. Each letter has a permanent web resource identifier (URN:NBN). In order to make it easier for users to decide if a letter is of interest, we have tagged subjects using reference words from ALLÄRS (common thesaurus in Swedish). We have also written abstracts of the content, divided them into separate “events” and tagged mentioned artworks, people and places to these events. Each artwork of Edelfelt has a page of its own. Here, users find a picture of the artwork (if available) and earlier sketches of the artwork (if available). By looking at the pictures, they can see how the working process of the painting has developed. Users can also follow the process through what Edelfelt writes in his letters. All the events from the letter abstracts that are tagged to the specific artwork are listed in chronological order on the artwork-page. Persons tagged in the letter abstracts also have pages of their own. On a person-page, users find basic facts and links to other webpages with information about the person. Any events from the letter abstracts mentioning the person are listed as well. In other words, through a one-click-solution users can find an overview on everything Edelfelt’s letters have to say about a specific person. Tagging persons to events has also made it possible to build graphs of a person’s social network; based on how many times other persons are tagged to the same events as the specific person. There is a link to these graphs on every person-page. Apart from researchers who have a direct interest in the material, we have also wanted to open up the cultural heritage to a broader public and group of users. Each month the editorial staff writes a blog-post on SLS-bloggen (http://www.sls.fi/sv/blogg). Albert Edelfelts brev also has a profile on Facebook (https://www.facebook.com/albertedelfeltsbrev/) where we post excerpts of letters on the same date as Edelfelt wrote the original letter. By doing so we hope to give the public an insight in the life of Edelfelt and the material, and involve them in the progress of the project. The web publication has open access. The mix of different sources and the co-operation with other heritage institutions has led to a mix of licenses for how users can copy and redistribute the published material. The Finnish National Gallery (FNG) owns copyright on its pictures in the publication and users have to get permission from FNG to copy and redistribute that material. The artwork-pages contain descriptions of the paintings written by the art historian Bertel Hintze, who published a catalogue of Edelfelt’s art in 1942. These texts are licensed with a Creative Commons Attribution-NoDerivs 4.0 Generic (CC BY-ND 4.0). Edelfelt’s letters as well as the texts and metadata produced by the editorial staff at the Society of Swedish Literature in Finland have a Creative Commons CC0 1.0 Universal-license. Data with Creative Commons-license is also freely available as open data through a REST API (http://edelfelt.sls.fi/apiinfo/). In the future, we would like to find a common practice for the user rights; if possible, even so all the material would have the same license. We intend to invite other institutions with artworks of Edelfelt to co-operate, offering the same kind of partnership as the web publication has with the Finnish National Gallery. Thus, we are striving to a complete as possible site with the artworks of Edelfelt. Albert Edelfelt is of national interest and his letters, which he mostly wrote during his stays abroad, contain information of international interest. Therefore, we plan to offer the metadata and at least some of the source material in Finnish and English translations. So far, the letters are only available as facsimile. The development of transcription programs for handwritten texts has made it probable that we in the future could include transcriptions of the letters in the web publication. Linguists especially have an interest in getting a searchable letter transcription for their researches, and the transcriptions would even be helpful for users who might have problem reading the handwritten text. Poster [abstract] Metadata Analysis and Text Reuse Detection: Reassessing public discourse in Finland through newspapers and journals 1771–1917 Filip Ginter¹, Antti Kanner², Leo Lahti¹, Jani Marjanen², Eetu Mäkelä², Asko Nivala¹, Heli Rantala¹, Hannu Salmi¹, Reetta Sippola¹, Mikko Tolonen², Ville Vaara², Aleksi Vesanto² ¹University of Turku; ²University of Helsinki During the period 1771–1917 newspapers developed as a mass medium in the Grand Duchy of Finland. This happened within two different imperial configurations (Sweden until 1809 and Russia 1809–1917) and in two main languages – Swedish and Finnish. The Computational History and the Transformation of Public Discourse in Finland, 1640–1910 (COMHIS) project studies the transformation of public discourse in Europe and in Finland via an innovative combination of original data, state-of-the-art quantitative methods that have not been previously applied in this context, and an open source collaboration model. In this study the project combines the statistical analysis of newspaper metadata and the analysis of text reuse within the papers to trace the expansion of and exchange in Finnish newspapers published in the long nineteenth century. The analysis is based on the metadata and content of digitized Finnish newspapers published by the National library of Finland. The dataset includes full text of all newspapers and most periodicals published in Finland between 1771 and 1920. The analysis of metadata builds on data harmonization and enrichment by extracting information on columns, type sets, publications frequencies and circulation records from the full-text files or outside sources. Our analysis of text reuse is based on a modified version of the Basic Local Alignment Search Tool (BLAST) algorithm, which can detect similar sequences and was initially developed for fast alignment of biomolecular sequences, such as DNA chains. We have further modified the algorithm in order to identify text reuse patterns. BLAST is robust to deviations in the text content, and as such able to effectively circumvent errors or differences arising from optical character recognition (OCR). By relating metadata on publication places, language, number of issues, number of words, size of papers, and publishers and comparing that to the existing scholarship on newspaper history and censorship, the study provides a more accurate bird’s-eye view of newspaper publishing in Finland after 1771. By pinpointing key moments in the development of journalism the study suggest that the while the discussions in the public were inherently bilingual, the technological and journalistic developments advanced at different speeds in Swedish and Finnish language forums. It further assesses the development of the press in comparison with book production and periodicals, pointing towards a specialization of newspapers as a medium in the period post 1860. Of special interest is that the growth and specialization of the newspaper medium was much indebted to the newspapers being established all over the country and thus becoming forums for local debates. The existence of a medium encompassing the whole country was crucial to the birth of a national imaginary. Yet, the national public sphere was not without regional intellectual asymmetries. This study traces these asymmetries by analysing text reuse in the whole newspaper corpus. It shows which papers and which cities functioned as “senders” and “receivers” in the public discourse in this period. It is furthermore essential that newspapers and periodicals had several functions throughout the period, and the role of the public sphere cannot be taken for granted. The analysis of text reuse further paints a picture of virality in newspaper publishing that was indicative of modern journalistic practices but also reveals the rapidly expanding capacity of the press. These can be further contrasted to other items commonly associated with the birth of modern journalism such as publication frequency, page sizes and typesetting of the papers. All algorithms, software, and the text reuse database will be made openly available online, and can be located through the project’s repositories (https://comhis.github.io/ and https://github.com/avjves/textreuse-blast). The results of the text reuse detection carried out in BLAST are stored in a database and will also be made available for the exploration of other researchers. Poster [abstract] Oceanic Exchanges: Tracing Global Information Networks In Historical Newspaper Repositories, 1840-1914 Hannu Salmi, Mila Oiva, Asko Nivala, Otto Latva University of Turku, Oceanic Exchanges: Tracing Global Information Networks in Historical Newspaper Repositories, 1840-1914 (OcEx) is a Digging into Data – Transatlantic Platform funded international and interdisciplinary project with a focus on studying spreading of news globally in the nineteenth century newspapers. The project combines digitized newspapers from Europe, US, Mexico, Australia, New Zealand, and the British and Dutch colonies of that time all over the world. The project examines patterns of information flow, spread of text reuse, and global conceptual changes across national, cultural and linguistic boundaries in the nineteenth century newspapers. The project links the different newspaper corpora, scattered into different national libraries and collections using various kinds of metadata and printed in several languages, into one whole. The project proposes to present a poster in the Nordic Digital Humanities Conference 2018. The project started in June 2017, and the aim of the poster is to present the current status of the project. The research group members come from Finland, the US, the Netherlands, Germany, Mexico, and UK. OcEx’s participating institutions are Loughborough University, Northeastern University, North Carolina State University, Universität Stuttgart, Universidad Nacional Autónoma de México, University College London, University of Nebraska-Lincoln, University of Turku, and Utrecht University. The project’s 90 million newspaper pages come from Australia's Trove Newspapers, the British Newspapers Archive, Chronicling America (US), Europeana Newspapers, Hemeroteca Nacional Digital de México, National Library of Finland, National Library of the Netherlands (KB), the National Library of Wales, New Zealand’s PapersPast, and a strategic collaboration with Cengage Publishing, one of the leading commercial custodians of digitized newspapers. Objectives Our team will hone computational tools, some developed in prior research by project partners and novel ones, into a suite of openly available tools, data, and analyses that trace a broad range of language-related phenomena (including text reuse, translational shifts, and discursive changes). Analysing such parameters enables us to characterize “reception cultures,” “dissemination cultures,” and “reference cultures” in terms of asymmetrical flow patterns, or to analyse the relationships between reporting targeted at immigrant communities and their surrounding host countries. OcEx will leverage existing relationships and agreements between its teams and data providers to connect disparate digital newspaper collections, opening new questions about historical globalism and modeling consortial approaches to transnational newspaper research. OcEx will take up challenging questions of historical information flow, including: 1. Which stories spread between nations and how quickly? 2. Which texts were translated and resonated across languages? 3. How did textual copying (reprinting) operate internationally compared to conceptual copying (idea spread)? 4. How did the migration of texts facilitate the circulation of knowledge, ideas, and concepts, and how were these ideas transformed as they moved from one Atlantic context to another? 5. How did geopolitical realities (e.g. economic integration, technology, migration, geopolitical power) influence the directionality of these transnational exchanges? 6. How does reporting in immigrant and ethnic communities differ from reporting in surrounding host countries? 7. Does the national organization of digitized newspaper archives artificially foreclose globally-oriented research questions and outcomes? Methodology OcEx will develop a semantic interoperable knowledge structure, or ontology, for expressing thematic and textual connections among historical newspaper archives. Even with standards in place, digitization projects pursue differing approaches that pose challenges to integration or particular levels of analysis. In most, for instance, generic identification of items within newspapers has not been pursued. In order to build an ontology, this project will build on knowledge acquired by participating academic partners, such as the project TimeCapsule at Utrecht University, as well as analytical software that has been tested and used by team members, such as viral text analysis. OcEx does not aim to create a totalizing research infrastructure but rather to expose the conditions by which researchers can work across collections, helping guide similar projects in future seeking to bridge national collections. This ontology will be established through comparative investigations of phenomena illustrating textual links: reprinting and topic dissemination. We have divided the tasks into six work packages: WP1: Management ➢ create an international network of researchers to discuss issues of using and accessing newspaper repository data and combine expertise toward better development and management of such data; ➢ assemble a project advisory board, consisting of representatives of public and private data custodians and other critical stakeholders. WP2: Assessment of Data and Metadata ➢ investigate and develop classifier models of the visual features of newspaper content and genres; ➢ create a corpus of annotations on clusters/passages that records relationships among textual versions. WP3: Creating a Networked Ontology for Research ➢ create an ontology of genres, forms, and elements of texts to support that annotation; ➢ select and develop best practices based on available technology (semantic web standard RDF, linked data, SKOS, XML markup standards such as TEI). WP4: Textual Migration and Viral Texts ➢ analyze text reuse across archives using statistical language models to detect clusters of reprinted passages; ➢ perform analyses of aggregate information flows within and across countries, regions, and publications; ➢ develop adaptive visualization methods for results. WP5: Conceptual Migration and Translation Shifts ➢ perform scalable multilingual topic model inference across corpora to discern translations, shared topics, topic shifts, and concept drift within and across languages, using distributional analysis and (hierarchical) polylingual topic models; ➢ analyze migration and translation of ideas over regional and linguistic borders; ➢ develop adaptive visualization methods for the results. WP6: Tools of Delivery/Dissemination ➢ validation of test results in scholarly contexts/test sessions at academic institutions; ➢ conduct analysis of the sensitivity of results to the availability of corpora in different languages and levels of access; ➢ share findings (data structures/availability/compatibility, user experiences) with institutional partners; ➢ package code, annotated data (where possible), and ontology for public release. Poster [abstract] ArchiMob: A multidialectal corpus of Swiss German oral history interviews Yves Scherrer¹, Tanja Samardžić² ¹University of Helsinki, Department of Digital Humanities; ²University of Zurich, CorpusLab, URPP Language and Space Although dialect usage is prevalent in the German-speaking part of Switzerland, digital resources for dialectological and computational linguistic research are difficult to obtain. In this paper, we present a freely available corpus of spontaneous speech in various Swiss German dialects. It consists in transcriptions of video interviews with contemporary witnesses of the Second World War period in Switzerland. These recordings were produced by an association of Swiss historians called Archimob about 20 years ago. More than 500 informants stemming from all linguistic regions of Switzerland (German, French and Italian) and representing both genders, different social backgrounds, and different political views, were interviewed. Each interview is 1 to 2 hours long. In collaboration with the University of Zurich, we have selected, processed and analyzed a subset of 43 interviews in different Swiss German dialects. The goal of this contribution is twofold. First, we describe how the documents were transcribed, segmented and aligned with the audio source and how we make the data available on specifically adapted corpus query engines. We also provide an additional normalization layer in order to reduce the different types of variation (dialectal, speaker-specific and transcriber-specific) present in the transcriptions. We formalize normalization as a machine translation task, obtaining up to 90% of accuracy (Scherrer & Ljubešić 2016). Second, we show through some examples how the ArchiMob resource can shed new lights on research questions from digital humanities in general and dialectology and history in particular: • Thanks to the normalization layer, dialect differences can be identified and compared with existing dialectological knowledge. • Using language modelling, another technique borrowed from language technology, we can compute distances between texts. These distance measures allow us to identify the dialect of unknown utterances (Zampieri et al. 2017), localize transcriber effects and obtain a generic picture of the Swiss German dialect landscape. • Departing from the purely formal analysis of the transcriptions for dialectological purposes, we can apply methods such as collocation analysis to investigate the content of the interviews. By identifying the key concepts and events referred to in the interviews, we can assess how the different informants perceive and describe the same time period. Poster [abstract] Serious gaming to support stakeholder participation and analysis in Nordic climate adaptation research Tina-Simone Neset¹, Sirkku Juhola², Therese Asplund¹, Janina Käyhkö², carlo Navarra¹ ¹Linköping University,; ²Helsinki University Introduction While climate change adaptation research in the Nordic context has advanced significantly in recent years, we still lack a thorough discussion on maladaptation, i.e. the unintended negative outcomes as a result of implemented adaptation measures. In order to identify and assess examples of maladaptation for the agricultural sector, we developed a novel methodology, integrating visualization, participatory methods and serious gaming. This enables research and policy analysis of trade-offs between mitigation and adaptation options, as well as between alternative adaptation options with stakeholders in the agricultural sector. Stakeholders from the agricultural sector in Sweden and Finland have been engaged in the exploration of potential maladaptive outcomes of climate adaptation measures by means of a serious game on maladaptation in Nordic agriculture, and discussed their relevance and related trade offs. The Game The Maladaptation Game is designed as a single player game. It is web-based and allows a moderator to collect the settings and results for each player involved in a session, store these for analysis, and display these results on a ‘moderator screen’. The game is designed for agricultural stakeholders in the Nordic countries, and requires some prior understanding of the challenges that climate change can impose on Nordic agriculture as well as the scope and function of adaptation measures to address these challenges. The gameplay consists of four challenges, each involving multiple steps. At the start of the game, the player is equipped with a limited number of coins, which decrease for each measure that is selected. As such, the player has to consider the implications in terms of risk and potential negative effects of a selected measure as well as the costs for each of these measures. The player is challenged with four different climate related challenges – increased precipitation, drought, increased occurrence of pests and weeds, and a prolonged growing season - that are all relevant to Nordic agriculture. The player selects one challenge at a time. Each challenge has to be addressed, and once a challenge has been concluded, the player cannot return and revise the selection. When entering a challenge (e.g. precipitation) possible adaptation measures that can be taken to address this challenge in an agricultural context, are displayed as illustrated cards on the game interface. Each card can be turned to receive more information, i.e. a descriptive text and the related costs. The player can explore all cards before selecting one. The selected adaptation measure is then leading to a potential maladaptive outcome, which is again displayed as an illustrated card with an explanatory text on the backside. The player has to decide to reject or accept this potential negative outcome. If the maladaptive outcome is rejected, the player returns to the previous view, where all adaptation measures for the current challenge are displayed, and can select another measure, and make the decision whether to accept or reject the potential negative outcome that is presented for these. In order to complete a challenge, one adaptation measure with the related negative outcome has to be accepted. After completing a challenge, the player returns to the entry page, where, in addition to the overview of all challenges, a small scoreboard summarizes the selection made, displays the updated amount of coins as well as a score of maladaptation-points. These points represent the negative maladaptation score for the selected measures and are a measure that the player does not know prior to making the decision. The game continues until selections have been made for all four challenges. At the end of the game, the player has an updated scoreboard with three main elements: the summary of the selections made for each challenge, the remaining number of coins, and the total sum of the negative maladaptation score. The scoreboards of all players involved in a session appear now on the moderator screen. This setup allows the individual player to compare his or her pathways and results with other players. The key feature of the game is hence the stimulation of discussions and reflections concerning adaptation measures and their potential negative outcomes, both with regard to adding knowledge about adaptation measures and their impact as well as the threshold of when an outcome is considered maladaptive, i.e. what trade offs are made within agricultural climate adaptation. Preliminary conclusions from the visualization supported gaming workshops During autumn 2016, eight gaming workshops were held in Sweden and Finland. These workshops were designed as visualization supported focus groups, allowing for some general reflections, but also individual interaction with the web-based game. Stakeholders included farmers, agricultural extension officers, and representatives of branch organizations as well as agricultural authorities on the national and regional level. Focus group discussions were recorded and transcribed in order to analyze the empirical results with focus on agricultural adaptation and potential maladaptive outcomes. Preliminary conclusions from these workshops point towards several issues that relate both to content and functionality of the game. While, as a general conclusion, the stakeholders were able to quickly get acquainted with the game and interact without larger difficulties, some few individual participants were negative to the general idea of engaging with a game to discuss these issues. The level of interactivity that the game allows, where players can test and explore, before making a decision, enabled reflections and discussions also during the gameplay. Stakeholders frequently tested and returned to some of the possible choices before deciding on their final setting. Since the game demands the acceptance of a potential negative outcome, several stakeholders described their impression of the game as a ‘pest or cholera’ situation. In terms of empirical results, the workshops generated a large number of issues regarding the definition of maladaptive outcomes and their thresholds, in relation to contextual aspects, such as temporal and spatial scales, as well as reflections regarding the relevance and applicability of the proposed adaptation measures and negative outcomes. Poster [abstract] Challenges in textual criticism and editorial transparency Elisa Johanna Veit, Pieter Claes, Per Stam Svenska litteratursällskapet i Finland, Henry Parlands Skrifter (HPS) is a digital critical edition of the works and correspondence of the modernist author Henry Parland (1908–1930). The poster presents chosen strategies for communicating the results of the process of textual criticism in a digital environment. How can we make the foundations for editorial decisions transparent and easily accessible to a reader? Textual criticism is by one of several definitions “the scientific study of a text with the intention of producing a reliable edition” (Nationalencyklopedin, “textkritik”. Our translation.) When possible, the texts of the HPS edition are based on original prints whose publication was initiated by the author during his lifetime. However, rendering a reliable text largely requires a return to original manuscripts as only a fraction of Parland’s works were published before the author’s death at the age of 22 in 1930. Posthumous publications often lack reliability due to the editorial practices and sometimes primarily aesthetic solutions to text problems of later editors. The main structure of the Parland digital edition is related to Zacharias Topelius Skrifter (topelius.sls.fi) and similar editions (e.g. grundtvigsværker.dk). However, the Parland edition has foregone the system of a – theoretically – unlimited amount of columns in favour of only two fields for text: a field for the reading text, which holds a central position on the webpage, and a smaller, optional, field containing, in different tabs, editorial commentary, facsimiles and transcriptions of manuscripts and original prints. The benefit of this approach is easier navigation. If a reader wishes to view several fields at once, they may do so by using several browser windows, which is explained in the user’s guide. The texts of the edition are transcribed in XML and encoded following TEI (Text Encoding Initiative) Guidelines P5. Manuscripts, or original prints, and edited reading texts are rendered in different files (see further below). All manuscripts and original prints used in the edition are presented as high-resolution facsimiles. The reader thus has access to the different versions of the text in full, as a complement to the editorial commentary. Parland’s manuscripts often contain several layers of changes (additions, deletions, substitutions): those made by the author himself during the initial process of writing or during a later revision, and those made by posthumous editors selecting and preparing manuscripts for publication. The editor is thus required to analyse the manuscripts in order to include only changes made by the author in the text of the edition. The posthumous changes are included in the transcriptions of the manuscripts and encoded using the same TEI elements as the author’s changes with an addition of attributes indicating the other hand and pen (@hand and @medium). In the digital edition these changes, as well as other posthumous markings and notes, are displayed in a separate colour. A tooltip displays the identity of the other hand. One of the benefits of this solution is transparency towards the reader through visualization of the editor’s interpretation of all sections of the manuscript. The using of standard TEI elements and attributes facilitate possible use of the XML-documents for purposes outside of the edition. For the Parland project, there were also practical benefits concerning technical solutions and workflow in using mark-up that had already, though to a somewhat smaller extent, been used by the Zacharias Topelius edition. The downside to using the same elements for both authorial and posthumous changes is that the XML-file will not very easily lend itself to a visualization of the author’s version. Although this surely would not be impossible with an appropriately designed stylesheet, we have deemed it more practical to keep manuscripts and edited reading texts in separate files. All posthumous intervention and associated mark-up are removed from the edited text, which has the added practical benefit of making the XML-document more easily readable to a human editor. However, the information value of the separate files is more limited than that of a single file would be. The file with the edited text still contains the complete author’s version, according to the critical analysis of the editor. Editorial changes to the author’s text are grouped together with the original wording in the TEI-element choice and the changes are visualized in the digital edition. The changed section is highlighted and the original wording displayed in a tooltip. Thus, the combination of facsimile, transcription and edited text in the digital edition visualizes the editor’s source(s), interpretation and changes to the text. Sources Nationalencyklopedin, “textkritik”. http://www.ne.se/uppslagsverk/encyklopedi/lång/textkritik (accessed 2017-10-19). Poster [publication ready] Digitizing the Icelandic-Danish Blöndal Dictionary Steinþór Steingrímsson The Árni Magnússon Institute for Icelandic Studies, Iceland, The Icelandic-Danish dictionary, compiled by Sigfús Blöndal in the early 20th century is being digitized. It is the largest dictionary ever published in Icelandic, containing in total more than 150,000 entries. The digitization work started with a pilot project in 2016 resulting in a comprehensive plan on how to carry out the task. The paper describes the ongoing work, methods and tools applied as well as the aim of the project and rationale. We opted for using OCR and not double-keying, which has become common for similar projects. First results suggest the outcome is satisfactory, as the final version will be proofread. The entries are annotated with XML-entities, using a workbench built for the project. We apply automatic annotation for the most consistent entities, but other annotation is carried out manually. The data is then exported into a relational database, proofread and finally published. Publication date is set for spring 2020. Poster [abstract] Network visualization for historical corpus linguistics: externally-defined variables as node attributes Timo Korkiakangas University of Oslo, In my poster presentation, I will explore whether and how network visualization can benefit philological and historical-linguistic research. This will be implemented by examining the usability of network visualization for the study of early medieval Latin scribes' language competences. Thus, the scope is mainly methodological, but the proposed methodological choices will be illustrated by applying them to a real data set. Four linguistic variables extracted corpus-linguistically from a treebank will be examined: spelling correctness, classical Latin prepositions, genitive plural form, and <ae> diphthong. All the four are continuous, which is typical of linguistic variables. The variables represent different domains of language competence of the scribes who learnt written Latin practically as a second-language by that time. Even more linguistic features will be included in the analysis if my ongoing project proceeds as planned. Thus, the primary objective of the study is to find out whether the network visualization approach has demonstrable advantages compared to ordinary cross-tabulations as far as support to philological and historical-linguistic argumentation is concerned. The main means of visualization will be the gradient colour palette in Gephi, a widely used open-source network analysis and visualization software package. As an inevitable part of the described enterprise, it is necessary to clarify the scientific premises for the use of network environment to display externally-defined values of linguistic variables. It is obvious that in order to be utilized for research purposes, network visualization must be as objective and replicable as possible. By way of definition, I emphasize that the proposed study will not deal with linguistic networks proper, i.e. networks which are directly induced or synthesized from a linguistic data set and represent abstract relations between linguistic units. Consequently, no network metric will be calculated, even though that might be interesting as such. What will be visualized are the distributions of linguistic variables that do not arise from the network itself, but are derived externally from a medium-sized treebank by exploiting its lemmatic, morphological, and, hopefully, also syntactic annotation layers. These linguistic variables will be visualized as attributes of the nodes in the trimodal "social" network which consists of the documents, persons, and places that underlie the treebank. These documents, persons, and places are encoded as the metadata in the treebank. The nodes are connected to each other by unweighted edges. The number of document nodes is 1,040, scribe nodes 220, and writing place nodes 84. In most cases, the definition of the 220 writer nodes is straightforward, given that the scribes scrupulously signed what they wrote, with the exception of eight documents. The place nodes are more challenging. Although 78% of the documents has been written in the city of Lucca, the disambiguation and re-grouping of small localities of which little is known was time-consuming and the results not always fully satisfying. The nodes will be set on the map background by utilizing Gephi's Geo Layout and Force Atlas 2 algorithms. The linguistic features that will be visualized reflect the language change that took place in late Latin and early medieval Latin, roughly the 3rd to 9th centuries AD. The features are operationalized as variables which quantify the variation of those features in the treebank. This quantification is based on the numerical output of a plethora of corpus-linguistic queries which extract from the treebank all constructions or forms that meet the relevant criteria. The variables indicate the relative frequency of the examined features in each document, scribe, and writing place. For the scribes and writing places, the percentages are calculated by counting the occurrences within all the documents written by that scribe or in that place, respectively. The resulting linguistic variables are continuous, hence the practicality of the gradient colouring. In order to ground colouring in the statistical dispersion of the variable values and to conserve maximal visual effect, I customize the Gephi default red-yellow-blue palette so that the maximal yellow, which stands for the middle of the colour scale, marks the mean of the distribution of each variable. Likewise, the thresholds of the maximal red and maximal blue are set equally far from the mean. I chose that distance to be two standard deviations away from the mean. In this way, only around 2.5% of the nodes with the lowest and highest values at both ends of the distribution are maximally saturated with red and blue while the rest, around 95%, of the nodes features a gradient colour, including the maximal yellow in the between. Following this rule, I will illustrate the variables both separately and as a sum variable. The images will be available in the poster. The sum variable will be calculated by aggregating the standardized simple variables. The preliminary conclusions include the observation that network visualization, as such, is not a sufficient basis for philological or historical-linguistic argumentation, but if used along with statistical approach, it can support argumentation by drawing attention to unexpected patterns and – on the other hand – to irregularities. However, it is the geographical layout of the graphs that gives the most of the surplus in regard to traditional approaches: it helps in perceiving patterns that would have otherwise failed to be noticed. The treebank on which the analyses are based is the Late Latin Charter Treebank (version 2, LLCT2), which consists of 1,040 early medieval Latin documentary texts (c. 480,000 words). The documents have been written in historical Tuscia (Tuscany), Italy, between AD 714 and 897, and are mainly sale or purchase contracts or donations, accompanied by a few judgements as well as lists and memoranda. LLCT2 is still under construction and only the first half of it is already provided with the syntactically annotated layer, thus making it a treebank proper (i.e. LLCT, version 1). The lemmatization and morphological annotation style are based on the Ancient Greek and Latin Dependency Treebank (AGLDT) style which can be deduced from the Guidelines for the Syntactic Annotation of Latin Treebanks. Korkiakangas & Passarotti (2011) define a number of additions and modifications to these general guidelines which are designed for Classical Latin. For a more detailed description of the LLCT2 and the underlying text editions, see Korkiakangas (in press). Documents are privileged material for examining the spoken/written interface of early medieval Latin, in which the distance between the spoken and written codes had grown considerable by the Late Antiquity. The LLCT2 documents have precise dating and location metadata and they survive as originals. Bibliography Adams J.N. Social variation and the Latin language. Cambridge University Press (Cambridge), 2013. Araújo T. and Banisch S. Multidimensional Analysis of Linguistic Networks. Mehler A., Lücking A., Banisch S., Blanchard P. and Job, B. (eds) Towards a Theoretical Framework for Analyzing Complex Linguistic Networks. Springer (Berlin, Heidelberg), 2016, 107-131. Bamman D., Passarotti M., Crane G. and Raynaud S. Guidelines for the Syntactic Annotation of Latin Treebanks (v. 1.3), 2007 http://nlp.perseus.tufts.edu/syntax/treebank/ldt/1.5/docs/guidelines.pdf. Barzel B. and Barabási A.-L. Universality in network dynamics. Nature Physics. 2013;9:673-681. Bergs A. Social Networks and Historical Sociolinguistics: Studies in Morphosyntactic Variation in the Paston Letters. Walter de Gruyter (Berlin), 2005. Ferrer i Cancho R. Network theory. Hogan P.C. (ed.) The Cambridge Encyclopedia of the Language Sciences. Cambridge University Press (Cambridge), 2010, 555–557. Korkiakangas T. (in press) Spelling Variation in Historical Text Corpora: The Case of Early Medieval Documentary Latin. Digital Scholarship in the Humanities. Korkiakangas T. and Lassila M. Abbreviations, fragmentary words, formulaic language: treebanking medieval charter material. Mambrini F., Sporleder C. and Passarotti M. (eds) Proceedings of the Third Workshop on Annotation of Corpora for Research in the Humanities (ACRH-3), Sofia, December 13, 2013. Bulgarian Academy of Sciences (Sofia), 2013, 61-72. Korkiakangas T. and Passarotti M. Challenges in Annotating Medieval Latin Charters. Journal of Language Technology and Computational Linguistics. 2011;26,2:103-114. Poster [abstract] Approaching a digital scholarly edition through metadata Katarina Pihlflyckt Svenska litteratursällskapet i Finland r.f. This poster presents a flowchart with an overview of the database structure in the digital critical edition of Zacharias Topelius Skrifter (ZTS). It shows how the entity relations open a possibility for the user to approach the edition from other angles than the texts, using informative metadata through indexing systems. Through this data, a historian can easily capture for example events, meetings between people or editions of books, as they are presented in Zacharias Topelius’ (1818–1898) texts. Presented here are both already available features and features in progress. ZTS comprises eight digital volumes hitherto, the first published in 2010. This includes the equivalent of about 8 500 pages of text by Topelius, 600 pages of introduction by editors and 13 000 annotations. The published volumes cover poetry, short stories, correspondences, children’s textbooks, historical-geographical works and university lectures on history and geography. It is freely accessible at topelius.sls.fi. Genres still to be published include children’s books, novels, journalism, academica, diaries and religious texts. DATABASE STRUCTURE The ZTS database structure consists of six connected databases: people, places, bibliography, manuscripts, letters and a chronology. So far, the people database consists of about 10 000 unique persons, and a possibility to link them to a family or group level (250 records). It has separate chapters for mythological persons (500 records) and fictive characters (250 records). The geographic database has 6 000 registered places. The bibliographic database has 6 000 editions divided on 3 500 different works, and the manuscript database has 1 400 texts on 350 physical manuscripts. The letter database has 4 000 registered letters to and from Topelius, divided on 2 000 correspondences. The chronology of Topelius life has 7 000 marked events. The indexing of objects started in 2005, using the FileMaker system. New records are continuously added and the work with finding more possibilities on how to use, link and present the data is in constant progress. The users can freely access the information in database records that link to the published volumes. The bibliographic database is the most complex database. The structure follows the Functional Requirements for Bibliographic Records (FRBR) model, which means we are making a difference between the abstract work and the published manifestations (editions) of that work. The FRBR focuses on the content relationship and continuum between the levels; anything regarded a separate work starts as a new abstract record, from where its own editions are created. Within ZTS, the abstract level has a practical significance, in cases when it is impossible to determine to which exact edition Topelius is referring. Also taken in consideration is that for example articles and short stories can have their own independent editions as well as being included in editions (e.g. a magazine, an anthology). This requires two different manifestation levels subordinated the abstract level; the regular editions and the texts included in other editions, the records of the latter type must always link to records of the former. The manuscript database has a content relationship to the bibliographic database through the abstract entity of a work. A manuscript text can be regarded as an independent edition of a work in this context (a manuscript that was never published can easily have a future edition added in the bibliographic database). The manuscript text itself might share physical paper with another manuscript text. Therefore, the description of the physical manuscript is created on a separate level in the manuscript database, to which the manuscript text is connected. The letter database follows the FRBR model; an upper level presents the whole correspondence between Topelius and another person, and a subordinated level describes each physical letter within the correspondence. It is possible to attach additional corresponding persons to occasional letters. The people database connects to the letter database and the bibliographic database, creating a one-to-many relationship. Any writer or author has to be in the people database in order to have their information inserted into these two databases. Within the people database there is also a family or group level, where family members can be grouped, but in contrary to the letter database, this is not a superordinate level. The geographic database follows a one-level structure. Places in letters and manuscripts can be linked from the geographic database. The chronology database contains manually added key events from Topelius’ life, as well as short diary entries made by him in various calendars during his life. It also has automatically gathered records from other databases, based on marked dates when Topelius works were published or when he wrote a letter or a manuscript. The dates of birth and/or death of family members and close friends can be linked from the people database. POSSIBILITIES FOR THE USER Approaching a digital scholarly edition with over 8 500 pages can be a heavy task, and many will likely use the edition more as an object to study, rather than texts to read. For a user not familiar with the content of the different volumes, but still looking for specific information, advanced searches and indexing systems offer a faster path into the relevant text passages. The information in the ZTS database records provides a picture of Finland in the 19th century as it appears in Topelius’ works and life. A future feature for users is access to this data through an API (Application Programming Interface). This will create opportunities for the user to take advantage of the data in any wanted way: to create a 19th century bookshelf, an app for the most popular 19th century names or a map of popular student hangouts in 1830’s Helsinki. Through the indexes formed by the linked data from the texts, the user can find all the occurrences of a person, a place or a book in the whole edition. One record can build a set of ontological relations, and the user can follow a theme, while moving between texts. A search for a person will provide the user with information about where Topelius mentions this person, whether it is in a letter, in his diaries or in a textbook for schoolchildren, or if he possibly meets or interacts with the person. Furthermore, the user can see if this person was the author, publisher or perhaps translator of a book mentioned by Topelius in his texts, or if the editors of ZTS have used the book as a source for editorial comments. The user will also be able to get a list of letters the person wrote to or received from Topelius. The geographic index can help the user create a geographic ontology with an overview of Topelius’ whereabouts through the annotated mentions of places in Topelius’ diaries, letters and manuscripts. The chronology creates a base for a timeline that will not only give the user key events from Topelius’ life, but also links to the other database records. Encoded dates in the XML files (letters, diaries, lectures, manuscripts etc.) can lead the user directly to the relevant text passages. The relation between the bibliographic database and the manuscript database creates a complete bibliography over everything Topelius wrote, including all known manuscripts and editions that relate to a specific work. So far, there are 900 registered independent works by Topelius in the bibliographic database; these works are implemented in 300 published editions (manifestations) and 2 900 text versions included in those manifestations or in other independent manifestations. The manuscript database consists of 1 400 manuscript texts. The FRBR model offers different ways of structuring the layout of a bibliography according to the user’s needs, either through the titles of the abstract works with subordinate manifestations, or directly through the separate manifestations. The bibliography can be limited to show only editions published during Topelius’ lifetime, or to include later editions as well. Furthermore, the bibliography points the user to the published texts and manuscripts of a specific work in the ZTS edition and to text passages where the author himself discusses the work in question. The level of detail is high in the records. For example, we register different name forms and spellings (Warschau vs Warszawa). Such information is included in the index search function and thereby eliminates problems for the end user trying to find information. Topelius often uses many different forms and abbreviations, and performing an advanced search in the texts would seldom give a comprehensive result in these cases. The letter database includes reference words describing the contents of the correspondences. Thus, the possibilities for searching in the material are expanded beyond the wordings of the original texts. Poster [publication ready] A Tool for Exploring Large Amounts of Found Audio Data Per Fallgren, Zofia Malisz, Jens Edlund KTH Royal Institute of Technology, We demonstrate a method and a set of open source tools (beta) for non-sequential browsing of large amounts of audio data. The demonstration will contain first versions of a set of varied functionalities in their first stages, and will provide a good insight in how the method can be used to browse through large quantities of audio data efficiently. Poster [publication ready] The PARTHENOS Infrastructure Sheena Dawn Bassett PIN SCrl, PARTHENOS around two ERICs from the Humanities and Arts sector, DARIAH and CLARIN, along with ARIADNE, EHRI, CENDARI, CHARISMA and IPERION-CH and will deliver guidelines, standards, methods, pooled services and tools to be used by its partners and all the research community. Four broad research communities are addressed – History, Linguistic Studies, Archaeology, Heritage and Applied Disciplines and the Social Sciences. By identifying the common needs, PARTHENOS will support cross disciplinary research and provide innovative solutions. By applying the FAIR data principles to structure the work on common policies and standards, the project has produced tools to assist researchers to find and apply the appropriate ones for their areas of interest. A virtual research environment will enable the discovery and use of data and tools and further support is provided with a set of online training modules. Poster [abstract] Using rolling.classify on the Sagas of Icelanders: Collaborative Authorship in Bjarnar saga Hítdælakappa Daria Glebova Russian State Academy of Science, Institute of Slavonic Studies This poster will present the results of an application of the rolling.classify function in Stylo (R) to the source with an unknown authorship and extremely poor textual history – Bjarnar saga Hítdælakappa, one of the medieval Sagas of Icelanders. This case study sets the usual for Stylo authorship attribution goal aside and concentrates on the composition of the main witness of Bjarnar saga, ms. AM 551 d α, 4to (17th c.), which was the source for the most of Bjarnar saga existing copies. It aims not only to find and visualise new arguments for the working hypothesis about the AM 551 d α, 4to composition but also to touch upon main questions that rise before a student of philology daring to use Stylo on the Old Icelandic saga ground, i.e. what Stylo tells us, what it does not, and how can one use it while exploring the history of a text that exists only in one source. It has been noticed that Bjarnar saga shows signs of a stylistic change between the first 10 chapters and the rest of the saga – the characters suddenly change their behaviour (Sígurður Nordal 1938, lxxix; Andersson 1967, 137-140), the narrative becomes less coherent and, as it seems, acquires a new logic of construction (Finlay 1990-1993, 165-171). More detailed narrative analysis of the saga showed that there is a difference in the usage of some narrative techniques in the first and the second parts, i.e., for example, the narrator’s work with point of view and the amount of their intervention in the saga text (Glebova 2017, 45-57). Thus, the question is – what is the relationship between the first 10 chapters and the rest of Bjarnar saga? Is the change entirely compositional and motivated by the narrative strategy of the medieval compiler or it is actually a result of a compilation of two texts that have two different authors? As it often happens with sagas, the problem aggravates due to the Bjarnar saga poor preservation. There is not much to compare and work with; the most of the saga witnesses are copies from one 17th c. manuscript, AM 551 d α, 4to (Boer 1893, xii-xiv; Sígurður Nordal 1938, xcv-xcvii; Simon 1966 (I), 19-149). This manuscript also has its flaws as it has two lacunae, one in the very beginning of the saga (ch. 1-5,5 in ÍF III) and another in the middle (between ch. 14-15 in ÍF III). The second lacuna is unreconstructable while the first one is usually substituted by a fragment from the saga’s short reduction that was preserved in copies of 15th c. kings’ saga compilation, Separate saga St. Olaf in Bœjarbók (Finlay 2000, xlvi), and that actually ends right on the 10th chapter of the longer version. It seems that the text of the shorter version is a variant of the longer one (Glebova 2017, 13-17) and it has a reference that there has been more to the story but it was shortened; precise relationships between the short and long reductions, however, are impossible to reconstruct due to the lacuna in AM 551 d α, 4to. The existence of the short version with these particular length and contents is indeed very important to the study of Bjarnar saga composition in AM 551 d α, 4to as it creates a chance that the first 10 chapters of AM 551 d α, 4to could exist separately at some point of the Bjarnar saga’s text history or at least that these chapters were seen by the medieval compilers as something solid and complete. This would be the last word of the traditional philology concerning this case – the state of the sources does not allow saying more. Thus, is there anything else that could shed some light on the question whether these chapters existed separately or they were written by the same hand? In this study it was decided to try sequential stylometric analysis available in Stylo package for R (Eder, Kestemont, Rybicki 2013) as a function rolling.classify (Eder 2015). As we are interested in the different parts of the same text, rolling stylometry seems to be a more preferable method than cluster analysis, which takes the whole text as an entity and compares it to the reference corpus; alternatively, in case with rolling stylometry the text is divided into smaller segments that allows a deeper investigation of the stylistic variation in the text itself (Rybicki, Eder, Hoover 2016, 126). To do the analysis there was made a corpus from the two parts of Bjarnar saga and several other Old Icelandic sagas; the whole corpus was taken from sagadb.org in Modern Icelandic normalised orthography. Several tests were conducted, first, with one of the parts as a test set and then with another; a sample size from 5000 words to 2000. The preliminary results show that there is a stylistic division in the saga as the style of the first part is not present in the second one and vice versa. This would be an additional argument for the idea that the first 10 chapters existed separately and were added by the Bjarnar saga compiler during the saga construction. One could argue that it could be not an authorial but a generic division as the first part is set in Norway and deals a lot with St. Olaf; the change of genre could result in the change of style. However, Stylo counts the most frequent words, which are not so generically specific (like og, að, etc.); thus, the collaborative authorship still could have taken place. This would be an important result in context of the overall composition of the Bjarnar saga longer version as its structure shows traces of a very careful planning and also mirror composition (Glebova 2017, 18-33): could it be that the structure of one of the parts (maybe, the first one) influenced the other? Whatever be the case, while sewing together the existing material, the medieval compiler made an effort to create a solid text and this effort is worth studying with more attention. Bibliography: Andersson, Theodor M. (1967). The Icelandic Family Saga: An Analytic Reading. Cambridge, MA. Boer, Richard C. (1893). Bjarnar saga Hítdælakappa, Halle. Eder, M. (2015). “Rolling Stylometry.” Digital Scholarship in the Humanities, Vol. 31-3: 457–469. Eder, M., Kestemont, M., Rybicki, J. (2013). “Stylometry with R: A Suite of Tools.” Digital Humanities 2013: Conference Abstracts. University of Nebraska–Lincoln: 487–489. Finlay, A. “Nið, Adultery and Feud in Bjarnar saga Hítdælakappa.” Saga-Book of the Viking Society 23 (1990-1993): 158-178. Finlay, A. The Saga of Bjorn, Champion of the Men of Hitardale, Enfield Lock, 2000. Glebova D. A Case of An Odd Saga. Structure in Bjarnar saga Hítdælakappa. MA thesis, University of Iceland. Reykjavík, 2017 (http://hdl.handle.net/1946/27130). Rybicki, J., Eder, M., Hoover, David L. “Computational Stylistics and Text Analysis.” In Doing Digital Humanities: Practice, Training, Research, edited by Constance Compton, Richard J. Lane, Ray Siemens. London, New York: 123-144. Sigurður Nordal, and Guðni Jónsson (eds.) “Bjarnar saga Hítdælakappa.” In Borgfirðinga sögur, Íslenzk fornrit 3, 111-211. Reykjavík, 1938. Simon, John LeC. A Critical Edition of Bjarnar saga Hítdælakappa. Vol. 1-2. Unpublished PhD thesis, University of London, 1966. Poster [abstract] The Bank of Finnish Terminology in Arts and Sciences – a new form of academic collaboration and publishing Johanna Enqvist, Tiina Onikki-Rantajääskö University of Helsinki, This presentation concerns the multidisciplinary research infrastructure project “Bank of Finnish Terminology in Arts and Sciences (BFT)” as an innovative form of academic collaboration and publishing. The BFT, which was launched in 2012, aims to build a permanent and continuously updated terminological database for all fields of research in Finland. Content for the BFT is created by niche-sourcing, where the participation is limited to a particular group of experts in the participating subject fields. The project maintains a wiki-based website which offers an open and collaborative platform for terminological work and a discussion forum available to all registered users. The BFT thus opens not only the results but the whole academic procedure where the knowledge is constantly produced, evaluated, discussed and updated in an ongoing process. The BFT also provides an inclusive arena for all the interested people – students, journalists, translators and enthusiasts – to participate in the discussions relating to concepts and terms in Finnish research. Based on the knowledge and experiences accumulated during the BFT project we will reflect on the benefits, challenges, and future prospects of this innovative and globally unique approach. Furthermore, we will consider the possibilities and opportunities opening up especially in terms of digital humanities. Poster [publication ready] The Swedish Language Bank 2018: Research Resources for Text, Speech, & Society Lars Borin¹, Markus Forsberg¹, Jens Edlund², Rickard Domeij³ ¹University of Gothenburg; ²KTH Royal Institute of Technology; ³The Institute for Language and Folklore We present an expanded version of the Swedish research resource the Swedish Language Bank. The Language Bank, which has supported national and inter-national research for over four decades, will now add two branches, one focus-ing on speech and one on societal aspect of language, to its existing organiza-tion, which targets text. Poster [abstract] Handwritten Text Recognition and 19th Century Court Records Maria Kallio National Archives Finland, This paper will demonstrate how the READ project is developing new technologies that will allow computers to automatically process and search handwritten historical documents. These technologies are brought together in the Transkribus platform, which can be downloaded free of charge at https://transkribus.eu/Transkribus/. Transkribus enables scholars with no in-depth technological knowledge to freely access and exploit algorithms which can automatically process handwritten text. Although there is already a rather sound workflow in place, the platform needs human input in order to ensure the quality of the recognition. The technology must be trained by being shown examples of images of documents and their accurate transcriptions. This helps it to understand the patterns which make up characters and words. This training data is used to create a Handwritten Text Recognition model which is specific to a particular collection of documents. The more training data there is, the more accurate the Handwritten Text Recognition can become. Once a Handwritten Text Recognition model has been created, it can be applied to other pages from the same collection of documents. The machine analyses the image of the handwriting and then produces textual information about the words and their position on the page, providing best guesses and alternative suggestions for each word, with measures of confidence. This process allows Transkribus to provide the automatic transcription and full-text search of a document collection at high levels of accuracy. For the quality of the text recognition, the amount of training material is paramount. Current tests suggest that models for specific style of handwriting can reach a Character Error Rate of less than 5%. Transcripts with a Character Error Rate of 10% or below can be generally understood by humans and used for adequate keyword searches. A low Character Error Rate also makes it relatively quick and easy for human transcribers to correct the output of the Handwritten Text Recognition engine. These corrections can then be fed back into the model in order to make it more accurate. These levels also compare favorably with Optical Character Recognition, where 95-98% accuracy for early prints is possible. Of even more interest is the fact that a well-trained model is able to sustain a certain amount of differences in handwriting. Therefore, it can be expected that, with a large amount of training material, it will be possible to recognize the writing of an entire epoch (e.g. eighteenth-century English writing), in addition to that of specific writers. The case study of this paper is the Finnish court records from the 19th century. The notification records which contain cases concerning guardianships, titles and marriage settlements, form an enormous collection of over 600 000 pages. Although the material is in digital form, the usability is still poor due to the lack of indices or finding aids. With the help of the Handwritten Text Recognition the National Archives have the chance to provide the material in computer-readable form which allows users to search and use the records in whole new way. Poster [publication ready] An approach to unsupervised ontology term tagging of dependency-parsed text using a Self-Organizing Map (SOM) Seppo Nyrkkö University of Helsinki Tagging ontology-based terms on existing text content is a task often requiring human effort. Each ontology may have their own structure and schema for describing terms, making automation non-trivial. I suggest a machine learning estimation technique for term tagging which can learn semantic tagging from a set of sample ontologies with given textual examples, and expand its use for analyzing a large text corpus by comparing the found syntactic features in the text. The tagging technique is based on a dependency parsed text input and an unsupervised machine learning model, the Self-Organizing Map (SOM). Poster [abstract] Comparing Topic Model Stability Between Finnish, Swedish and French Simon Hengchen, Antti Kanner, Eetu Mäkelä, Jani Marjanen University of Helsinki Comparing Topic Model Stability Between Finnish, Swedish and French 1 Abstract In the recent years, topic modelling has gained increasing attention in the humanities. Unfortunately, little has been done to determine whether the output produced by this range of probabilistic algorithms is revealing signal or merely producing noise, nor how well it performs on other languages than English. In this paper, we set out to compare topic models of parallel corpora in Finnish, Swedish, and French, and propose a method to determine how well the topic modelling algorithms perform on those languages. 2 Context Topic modelling (TM) is a well-known (following the work of (4; 5)) yet badly understood range of algorithms within the humanities. While a variety of studies within the humanities make use of topic models to answer historical questions (see (2) for a thorough survey), there is no tried and true method that ascertains that the probabilistic algorithm reveals signal and is not merely responding to noise. The rule of thumb is generally that if the results are interesting and reveal a prior intuition by a domain expert, they are considered correct -- in the sense that they are a valid entry point into a humongous dataset, and that the proper work of historical research is to be then manually carried out on a subset selected by the algorithm. As pointed out in previous work (7; 3), this, combined with the fact that many humanistic corpora are on the small side, "the threshold for the utility of topic modelling across DH projects is as yet highly unclear." Similarly, topic instability "may lead to research being based on incorrect foundational assumptions regarding the presence or clustering of conceptual fields on a body of work or source material" (3). Whilst topic modelling techniques are considered language-independent, i.e. "use[] no manually constructed dictionaries, knowledge bases, semantic networks, grammars, syntactic parsers, or morphologies, or the like" (6), they encode keyassumptions about the statistical properties of language. These assumptions are often developed with English in mind and generalised to other languages without much consideration. We maintain that these algorithms are not language-independent, but language-agnostic at best, and that accounting for discrepancies in how different languages are processed by the same algorithms is necessary basic research for more applied, context-oriented research -- especially for the historical development of public discourses in multilingual societies or phenomena where structures of discourse flow over language borders. Indeed, some languages heavily rely on compounding -- the creation of a word through the combination of two or more stems -- in word formation, while others use determiners to combine simple words. If one considers a white space as the delimitation between words (as is usually done with languages making use of the Latin alphabet), the first tendency results in a richer vocabulary than the second, hence influencing TM algorithms that follow of the bag-of-words approach. Similarly, differences in grammar -- for example, French adjectives must agree in gender and number with the noun they modify, something that does not exist in other languages like English -- reinforce those discrepancies. Nonetheless, most of this happens in the fuzzy and non-standard preprocessing stage of topic modelling, and the argument could be made that the language neutrality of TM algorithms rests more on it being underspecified with regard to how to pre-process the language. In this paper, we propose to compare topic models on a custom-made parallel corpus in Finnish, Swedish, and French. By selecting those languages, we have a glimpse of how a selection of different languages are processed by TM algorithms. While concentrating on languages spoken in Europe and languages of interest of our collaborative network of linguists, historians and computer scientists, we are still able examine two crucial variables: one of genetic and one of cultural relatedness. French and Swedish belong to Indo-European (Romance and Germanic branches, respectively) and Finnish is a Finno-Ugrian language. Finnish and Swedish on the other hand share a long history of close language contact and cultural convergence. Because of this, Finnish contains a large number of Swedish loan words, and, perceivably, similar conceptual systems. 3 Methodology To explore our hypothesis, we use a parallel corpus of born-digital textual data in Finnish, Swedish, and French. Once the corpus is constituted, it becomes possible to apply LDA (1) and HDA (9) -- LDA is parametrised by humans, whereas HDA will attempt to automatically determine the best configuration possible. The resulting models for each language are stored, the corpora reduced in size, LDA is re-applied, the models are stored, corpora re-reduced, etc. Topic models are compared manually between languages at each stage, and programmatically between stages, using the Jaccard Index (8), for all languages. The same workflow is then applied to the lemmatised version of the above-mentioned corpora, and results compared. Bibliography [1] Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993{1022 (2003) [2] Brauer, R., Fridlund, M.: Historicizing topic models, a distant reading of topic modeling texts within historical studies. In: International Conference on Cultural Research in the context of \Digital Humanities", St. Petersburg: Russian State Herzen University (2013) [3] Hengchen, S., O'Connor, A., Munnelly, G., Edmond, J.: Comparing topic model stability across language and size. In: Proceedings of the Japanese Association for Digital Humanities Conference 2016 (2016) [4] Jockers, M.L.: Macroanalysis: Digital methods and literary history. University of Illinois Press (2013) [5] Jockers, M.L., Mimno, D.: Significant themes in 19th-century literature. Poetics 41(6), 750{769 (2013) [6] Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse processes 25(2-3), 259{284 (1998) [7] Munnelly, G., O'Connor, A., Edmond, J., Lawless, S.: Finding meaning in the chaos (2015) [8] Real, R., Vargas, J.M.: The probabilistic basis of jaccard's index of similarity. Systematic biology 45(3), 380{385 (1996) [9] Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. Journal of the American Statistical Association 101(476), 1566{1581 (2006) Poster [abstract] ARKWORK: Archaeological practices and knowledge in the digital environment Suzie Thomas², Isto Huvila¹, Costis Dallas³, Rimvydas Laužikas⁴, Antonia Davidovic⁹, Arianna Traviglia⁶, Gísli Pálsson⁷, Eleftheria Paliou⁸, Jeremy Huggett⁵, Henriette Roued⁶ ¹Uppsala University,; ²University of Helsinki; ³University of Toronto; ⁴Vilnius University; ⁵University of Glasgow; ⁶University of Venice; ⁷Umeå University; ⁸University of Copenhagen; ⁹Independent researcher Archaeology and material cultural heritage have often enjoyed a particular status as a form of heritage that has captured the public imagination. As researchers from many backgrounds have discussed, it has become the locus for the expression and negotiation of European, local, regional, national and intra-national cultural identities, for public policy regarding the preservation and management of cultural resources, and for societal value in the context of education, tourism, leisure and well-being. The material presence of objects and structures in European cities and landscapes, the range of archaeological collections in museums around the world, the monumentality of the major archaeological sites, and the popular and non-professional interest in the material past are only a few of the reasons why archaeology has become a linchpin in the discussions on how emerging digital technologies and digitization can be leveraged for societal benefit. However, at the time when nations and the European community are making considerable investments in creating technologies, infrastructures and standards for digitization, preservation and dissemination of archaeological knowledge, critical understanding of the means and practices of knowledge production in and about archaeology from complementary disciplinary perspectives and across European countries remains fragmentary, and in urgent need of concertation. In contrast to the rapid development of digital infrastructures and tools for archaeological work, relatively little is known about how digital information, tools and infrastructures are used by archaeologists and other users and producers of archaeological information such as archaeological and museum volunteers, avocational hobbyists, and others. Digital technologies (infrastructures, methods and resources) are reconfiguring aspects of archaeology across and beyond the lifecycle (i.e., also "in the wild"), from archaeological data capture in fieldwork to scholarly publication and community access/entanglement.Both archaeologists and researchers in other fields, from disciplines such as museum studies, ethnology, anthropology, information studies and science and technology studies have conducted research on the topic but so far, their efforts have tended to be somewhat fragmented and anecdotal. This is surprising, as the need of better understanding of archaeological practices and knowledge work has been identified for many years as a major impediment to realizing the potential of infrastructural and tools-related developments in archaeology. The shifts in archaeological practice, and in how digital technology is used for archaeological purposes, calls for a radically transdisciplinary (if not interdisciplinary) approach that brings together perspectives from reflexive, theoretically and methodologically-aware archaeology, information research, and sociological, anthropological and organizational studies of practice. This poster presents the COST Action “Archaeological practices and knowledge work in the digital environment” (http://www.cost.eu/COST_Actions/ca/CA15201 - ARKWORK), an EU-funded network which brings together researchers, practitioners, and research projects studying archaeological practices, knowledge production and use, social impact and industrial potential of archaeological knowledge to present and highlight the on-going work on the topic around Europe. ARKWORK (https://www.arkwork.eu/) consists of four Working Groups (WGs), with a common objective to discuss and practice the possibilities for applying the understanding of archaeological knowledge production to tackle on-going societal challenges and the development of appropriate management/leadership structures for archaeological heritage. The individual WGs have the following specific but complementary themes and objectives: WG1 - Archaeological fieldwork Objectives: To bring together and develop the international transdisciplinary state-of-the-art of the current multidisciplinary research on archaeological fieldwork. How archaeologists are conducting fieldwork and documenting their work and findings in different countries and contexts and how this knowledge can be used to make contributions to developing fieldwork practices and the use and usability of archaeological documentation by the different stakeholder groups in the society. WG2 - Knowledge production and archaeological collections Objectives: To integrate and push forward the current state-of-the-art in understanding and facilitating the use and curation of (museum) collections and repositories of archaeological data for knowledge production in the society. WG3 - Archaeological knowledge production and global communities Objectives: To bring together and develop the current state-of-the-art on the global communities (including indigenous communities, amateurs, neo-paganism movement, geographical and ideological identity networks and etc.) as producers and users in archaeological knowledge production e.g. in terms of highlighting community needs, approaches to communication of archaeological heritage, crowdsourcing and volunteer participation. WG4 - Archaeological scholarship Objectives: To integrate and push forward the current state-of-the-art in study of archaeological scholarship including academic, professional and citizen science based scientific and scholarly work. In our poster we outline each of the working groups and provide a clear overview of the purposes and aspirations of the COST Action Network ARKWORK Poster [publication ready] Research and development efforts on the digitized historical newspaper and journal collection of The National Library of Finland Kimmo Kettunen, Mika Koistinen, Teemu Ruokolainen University of Helsinki, Finland, The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12 million pages mainly in Finnish and Swedish. Out of these about 5.1 million pages are freely available on the web site digi.kansalliskirjasto.fi (Digi). The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1920. The last ten years, 1911–1920, were opened in February 2017. The digitized collection of NLF is part of globally expanding network of library produced historical data that offers researchers and lay persons insight into past. In 2012 it was estimated that there were about 129 million pages and 24 000 titles of digitized newspapers in Europe [1]. A very conservative estimation about worldwide number of titles is 45 000 [2]. The current number of available data is probably already much bigger, as the national libraries have been working steadily with digitization both in Europe, Northern America and rest of the world. This paper presents work that has been carried out in the NLF related to the historical newspaper and journal collection. We offer an overall account of research and development related to the data. Poster [abstract] Medieval Publishing from c. 1000 to 1500 Samu Kristian Niskanen, Lauri Iisakki Leinonen Helsinki University Medieval Publishing from c. 1000 to 1500 (MedPub) is a five-year project funded by the European Research Council, based at Helsinki University, and running from 2017 2022. The project seeks to define the medieval act of publishing, focusing on Latin authors active during the period from c. 1000 to 1500. A part of the project is to establish a database of networks of publishing. The proposed paper will discuss the main aspects of the projected database and the process of data-gathering. MedPub’s research hypothesis is that publication strategies were not a constant but were liable to change, and that different social, literary, institutional, and technical milieux fostered different approaches to publishing. As we have already proved this proposition, the project is now advancing toward the next step, the ultimate aim of which is to complement the perception of societal and cultural changes that took place during the period from c. 1000 and 1500. For the purposes of that undertaking, we define ‘publishing’ as a social act, involving at least two parties, an author and an audience, not necessarily always brought together. The former prepares a literary work and then makes it available to the latter. Medieval publishing was probably more often a more complex process. It could engage more parties than the two, such as commentators, dedicatees, and commissioners. The social status of these networks ranged from mediocre to grand. They could consist of otherwise unknown monks; or they could include popes and emperors. We propose that the composition of such literary networks was broadly reactive to large-scale societal and cultural changes. If so, networks of publishing can serve as a vantage point for the observation of continuity and change in medieval societies. We shall collect and analyse an abundance of data of publishing networks in order to trace how their composition in various contexts may reflect the wider world. It is that last-mentioned aspect that is the subject of this proposal. It is a central fact for this undertaking that medieval works very often include information on dedication, commission, and commendation; and that, more often than not, this evidence is uncomplicated to collect because the statements in question tend to be short and uniform and they normally appear in the prefaces and dedicatory letters with which medieval authors often opened their works. What is more, such accounts manifestly indicate a bond between two or more parties. By virtue of these features, the evidence in question can be collected in the quantities needed for large-scale statistical analysis and processed electronically. The function and form of medieval references to dedication and commission, furthermore, remained largely a constant. Eleventh-century dedications resemble those from, say, the fourteenth century. By virtue of such uniformity the data of dedications and commissions may well constitute a unique pool of evidence of social interaction in the Middle Ages. For the data of dedications and commissions can be employed as statistical evidence in various regional, chronological, social, and institutional contexts, something that is very rare in medieval studies. The proposed paper will introduce the categories of information the database is to embrace and put forward for discussion the modus operandi of how the data of dedications and commissions will be harvested. Poster [abstract] Making a bibliography using metadata Lars Bagøien Johnsen, Arthur Tennøe National Library of Norway, Norway, In this presentation we will discuss how one might create a bibliography using metadata taken from libraries in conjunction with other sources. As metadata, like topic keywords and Dewey decimal classification, is digitally available our focus is on metadata, although we also look at book contents where it is possible. Poster [abstract] Network Analysis, Network Modeling, and Historical Big Data: The New Networks of Japanese Americans in World War II Saara Kekki University of Helsinki Network analysis has become a promising methodology for studying a wide variety of systems, including historical populations. It brings new dimensions into the study of questions that social scientists and historians might traditionally ask, and allows for new questions that were previously impractical or impossible to answer using traditional methods. The increasing availability of digitized archival material and big data, however, are making it more appealing. When coupled with custom algorithms and interactive visualization tools, network analysis can produce remarkable new insights. In my ongoing doctoral research, I am employing network analysis and modeling to study the Japanese American incarceration in World War II (internment). Incarceration and the government-led dispersal of Japanese Americans disrupted the lives of some 110,000 people, including over 70,000 US citizens of Japanese ancestry, for the duration of the war and beyond. Many lost their former homes and enterprises and had to start their lives over after the war. Incarceration also had a very concrete impact on the communities: about 50% of those interned did not return to their old homes. This paper explores the changes that took place in the Japanese American community of Heart Mountain Relocation Center in Wyoming. I will especially investigate the political networks and power relations of the incarceration community. My aim is twofold: on the one hand, to discuss the changes in networks caused by incarceration and dispersal, and on the other, to address some opportunities and challenges presented by the method for the study of history. Poster [abstract] SuALT: Collaborative Research Infrastructure for Archaeological Finds and Public Engagement through Linked Open Data Suzie Thomas¹, Anna Wessman¹, Jouni Tuominen^2,3, Mikko Koho², Esko Ikkala², Eero Hyvönen^2,3, Ville Rohiola⁴, Ulla Salmela⁴ ¹University of Helsinki,Department of Philosophy, History, Culture and Art Studies; ²Aalto University, Semantic Computing Research Group (SeCo); ³University of Helsinki, HELDIG – Helsinki Centre for Digital Humanities; ⁴National Board of Antiquities, Library, Archives and Archaeological Collections The Finnish Archaeological Finds Recording Linked Database (Suomen arkeologisten löytöjen linkitetty tietokanta – SuALT) is a concept for a digital web service catering for discoveries of archaeological material made by the public; especially, but not exclusively, metal detectorists. SuALT, a consortium project funded by the Academy of Finland and commenced in September 2017, has key outputs at every stage of its development. Ultimately it provides a sustainable output in the form of Linked Data, continuing to facilitate new public engagements with cultural heritage, and research opportunities, long after the project has ended. While prohibited in some countries, metal detecting is legal in Finland, provided certain rules are followed, such as prompt reporting of finds to the appropriate authorities and avoidance of legally-protected sites. Despite misgivings by some about the value of researching metal-detected finds, others have demonstrated the potential of researching such finds, for example uncovering previously unknown artefact typologies. Engaging non-professionals with cultural heritage also contributes to the democratization of archaeology, and empowers citizens. In Finland metal detecting has grown rapidly in recent years. In 2011 the Archaeological Collections registered 31 single or assemblages of stray finds. In 2014, over 2700 objects were registered, in 2015, near 3000. In 2016 over 2500 finds were registered. When the finds are reported correctly, their research value is significant. The Finnish Antiquities Act §16 obligates the finder of an object for which the owner is not known, and which can be expected to be at least 100 years old, to submit or report the object and associated information to the National Board of Antiquities (Museovirasto – NBA); the agency responsible for cultural heritage management in Finland. There is also a risk, as finders get older and even pass away, that their discoveries and collections will remain unrecorded and that all associated information is lost permanently. In the current state of the art, while archaeologists increasingly use finds information and other data, utilization is still limited. Data can be hard to find, and available open data remains fragmented. SuALT will speed up the process of recording finds data. Because much of this data will be from outside of formal archaeological excavations, it may shed light on sites and features not usually picked up through ‘traditional’ fieldwork approaches, such as previously unknown conflict sites. The interdisciplinary approach and inclusion of user research promotes collaboration among the infrastructure’s producers, processors and consumers. By linking in with European projects, SuALT enables not only national and regional studies, but also contributes to international and transnational studies. This is significant for studies of different archaeological periods, for which the material culture usually transcends contemporary national boundaries. Ethical aspects are challenged due to the debates around engagement with metal detectorists and other artefact hunters by cultural heritage professionals and researchers, and we address head-on the wider questions around data sharing and knowledge ownership, and of working with human subjects. This includes the issues, as identified by colleagues working similar projects elsewhere, around the concerns of metal detectorists and other finders about sharing findspot information. Finally, the usability of datasets has to be addressed, considering for example controlled vocabulary to ease object type categorization, interoperability with other datasets, and the mechanics of verification and publication processes. The project is unique in responding to the archaeological conditions in Finland, and in providing solutions to its users’ needs within the context of Finnish society and cultural heritage legislation. While it focuses primarily on the metal detecting community, its results and the software tools developed are applicable more generally to other fields of citizen science in cultural heritage, and even beyond. For example, in many areas of collecting (e.g. coins, stamps, guns, or art), much cultural heritage knowledge as well as collections are accumulated and maintained by skilful amateurs and private collectors. Fostering collaboration, and integrating and linking these resources with those in national memory organizations would be beneficial to all parties involved, and points to future applications of the model developed by SuALT. Furthermore, there is scope to integrate SuALT into wider digital humanities networks such as DARIAH (http://www.dariah.eu). Framing SuALT’s development as a consortium enables us to ask important questions even at development stages, with the benefit of expertise from diverse disciplines and research environments. The benefits of SuALT, aside from the huge potential for regional, national, and transnational research projects and international collaboration, are that it offers long term savings on costs, shares expertise and provides greater sustainability than already possible. We will explore the feasibility of publishing the finds data through international aggregation portals, such as Europeana (http://www.europeana.eu) for cultural heritage content, as well as working closely with colleagues in countries that already have established national finds databases. The technical implementation also respects the enterprise architecture of Finnish public government. Existing Open Source solutions are further developed and integrated, for example the GIS platform Oskari.org (http://oskari.org) for geodata developed by the National Land Survey with the Linked Data based Finnish Ontology Service of Historical Places and Maps (http://hipla.fi). SuALT’s data is also disseminated through Finna (http://www.finna.fi), a leading service for searching cultural information in Finland. SuALT consists of three subprojects: subproject I “User Needs and Public Cultural Heritage Interactions” hosted by University of Helsinki; subproject II “National Linked Open Data Service of Archaeological Finds in Finland” hosted by Aalto University, and subproject III “Ensuring Sustainability of SuALT” hosted by the NBA. The primary aim of SuALT is to produce an open Linked Data service which is used by data producers (namely the metal detectorists and other finders of archaeological material), by data researchers (such as archaeologists, museum curators and the wider public), and by cultural heritage managers (NBA). More specifically, the aims are: a. To discover and analyse the needs of potential users of the resource, and to factor these findings into its development; b. To develop metadata models and related ontologies for the data that take into account the specific needs of this particular infrastructure, informed by existing models; c. To develop the Linked Data model in a way that makes it semantically interoperable with existing cultural heritage databases within Finland; d. To develop the Linked Data model in a way that makes it semantically interoperable with comparable ‘finds databases’ elsewhere in Europe, and e. To test the data resulting from SuALT through exploratory research of the datasets for archaeological research purposes for cultural heritage and collection management work. The project corresponds closely with the strategic plans of the NBA and responds to the growth of metal detecting in Finland. Internationally, it corresponds with the development of comparable schemes in other European countries and regions, such as Flanders (MetaaldEtectie en Archeologie – MEDEA initiated in 2014), and Denmark and the Netherlands (Digitale Metaldetektorfund or DIgital MEtal detector finds – DIME, and Portable Antiquities in the Netherlands – PAN, both initiated in 2016). It takes inspiration from the Portable Antiquities Scheme (PAS) Finds Database (https://finds.org.uk/database) in England and Wales. These all aspire to an ultimate goal of a pan-European research infrastructure, and will work together to seek a larger international collaborative research grant in the future. A contribution of our work in relation to the other European projects is to employ the Linked Data paradigm, which facilitates better interoperability with related datasets, additional data enrichment based on well-defined semantics and reasoning, and therefore better means for analysing and using the finds data in research and applications. The expected scientific impacts are that the process of developing SuALT, including critically analysing comparable resources, user group research, and creating innovative solutions, will in themselves produce a rich body of interdisciplinary academic output. This will be disseminated in peer reviewed journals and at selected conferences across several disciplinary boundaries including Computer Science, Archaeology, and Cultural Heritage Studies. It also links in, at a crucial moment in the development of digital heritage management, with parallel resources elsewhere in Europe. This means that not only can a coordinated and international approach be taken in development, but that it is extremely timely, taking advantage of the opportunity to benefit from the experiences and perspectives of colleagues pursuing similar resources. SuALT ensures that Finnish cultural heritage management is at the forefront of digital heritage. The project also carries out a small-scale ‘test’ project using the database as it forms, and in this way contributes to the field of artefact studies. The contribution to future knowledge sits at a number of levels. There are technical challenges to create the linked database in a way that complements and is interoperable with existing national and international infrastructures. Solving these challenges generates contributions to understanding digital data management and service. The process of consulting users represents an important case study in formative evaluation of particular interest groups with regard to digital heritage and citizen science, as well as shedding further light on different perceptions and uses of cultural heritage. SuALT relates to the emerging trend of publishing open science data, facilitating the analysis and reuse of the data, exemplified by e.g. DataONE (http://www.dataone.org) and Open Science Data Cloud (http://www.opensciencedatacloud.org). We hypothesise that SuALT will result in a sustainable digital data resource that responds to the different user needs, and which provides high quality archaeological research which draws on data from Finland. SuALT also enables integration with comparative data from abroad. Outputs throughout the development process represent important contributions to research into digital heritage applications and semantic computing, going the needs of the scientific community. The selected Linked Data methodology is suitable for archaeology and cultural heritage management due to the need to combine and connect heterogeneous data collections in the field (e.g. museum collections, finds databases abroad) and other datasets, such as vocabularies of places, persons, and time periods, benefiting cultural heritage professionals. Publishing the finds database as open data using standardised metadata formats facilitates the data’s re-use, fostering new research by the scientific community but also the development of novel applications for professionals and citizens. Taking a strategic approach to the challenge of creating this resource, and treating it as a research project, rather than developing an ad hoc resource, ensures that the project’s legacy is a significant and long-term contribution to digital curation of public-generated archaeological data. As its key societal impact, SuALT provides a vital interface for non-professionals to contribute to and benefit from Finland’s archaeological record, and to integrate this with comparable datasets from abroad. The project enhances cooperation between non-professionals and cultural heritage managers. Careful user research ensures that SuALT offers means of engagement and access to data and other information that is usable and meaningful to a wide range of users, from metal detectorists and amateur historians, through to professional curators, cultural heritage managers, and academic researchers, domestically and abroad. SuALT’s results are not limited to metal detection but have a wider impact: the same key challenges of engaging amateur collectors to collaborate with memory organization experts in citizen science are encountered in virtually all fields of collecting and maintaining tangible and intangible cultural heritage. The process of developing SuALT provides an unprecedented opportunity to research the use of digital platforms to engage the public with archaeological heritage in Finland. Inspired by successful initiatives such as PAS and MEDEA, the potential for individuals to self-record their finds also echoes the emerging use of crowdsourcing for public archaeology initiatives. Thus, SuALT offers a significant opportunity to contribute to further understanding digital cultural heritage and its uses, including its role within society. It is likely that the coordination of SuALT with digital finds recording initiatives in other countries will lead to a transnational platform for finds recording, giving Finland an opportunity to be at the forefront of digital heritage-based citizen science research and development. Poster [abstract] Identifying poetry based on library catalogue metadata Hege Roivainen University of Helsinki, Changes in printing reflect historical turning points: what has been printed, when, where and by whom are all derivatives of contemporary events and situations. Excessive need for war propaganda brings out more pamphlets from the printing presses, the university towns produce dissertations, which scientific development can be deduced from and strict oppression and censorship might allow only religious publications by government-approved publishers. The history of printing has been extensively studied and numerous monographs exist. However, most of the research has been qualitative studies based on close reading requiring a profound knowledge of the subject matter, yet still being unable to verify the extent of the new innovations. For example, close readings of library catalogues does not reveal, at least easily, the timeline of Luther’s publications, or what portion of books actually were octavo-sized and when the increase in this format occurred. One of the sources for these kinds of studies are national library metadata catalogs which contain information about physical book size, page counts, publishers, publication places and so forth. These catalogs have been researched in ways making use of quantitative analysis. The advantage of national library catalogs is that they often are more or less complete, having records of practically everything published in a certain country or linguistic area in a certain time period. The computational approach to them has enabled researchers to connect historical turning points to the effect on printing, and the impact of a new concept has been measured against the amount of re-publications, or the spread, of a book introducing a new idea. What is more, linking library metadata to the full text of the books has made it possible to analyze the change in the usage of words in massive corpora, while still limiting analysis to relevant books. In all these cases, computational methods work better the more complete the corpus is. However, library catalogues often lack annotations for one reason or another: annotating resources might have been cut at a certain point in time, or the annotation rules may have varied between different libraries in cases where catalogues have been amalgamated, or the rules could have just changed. One area that is particularly important for subcorpora research is genre. The genre field, when annotated for each of the metadata records, could be used to restrict the corpus to contain every one of the books that are needed and nothing more. From this subset there is a possibility of drawing timelines or graphs based on bibliographic metadata, or in the case of full texts existing, the language or contents of a complete corpus could be analysed. Despite the significance of the genre information, that particular annotation bit is often lacking. In English Short Title Catalogue (ESTC) the genre information exists for approximately one fourth of the records. This should be enough for teaching a model for machine learning and trying to deduce the genre information, rather than relying solely on the annotations of librarians. The metadata field containing genre information in ESTC can contain more than one value. In most cases this means having a category and its subcategories as different values, but not always. Because of the complex definition of genre in ESTC this paper focuses on one genre only: poetry. Besides being a relatively common genre, poetry is also of interest to literary researchers. Having a nearly complete subset of English poetry would allow for large-scale quantitative poetry analysis. The downside to library metadata catalogues is, that they contain merely the metadata, not the complete unabridged texts, which would be beneficial for machine learning modeling. I tackled this shortcoming by creating several models each packed with similar features within that set. The main ingredient for these feature sets was a concatenation of the main title and the subtitle from the library metadata. From these concatenations I created one feature set contained easily calculable features known from the earliest stylometric research, such as word counts and sentence lengths. Another set I collected with bag-of-words method taking the frequencies of the most common words from a subset of poetry book titles. I also built one set for part-of-speech (POS) tags and another one for POS trigrams. Some feature sets were extracted from the other metadata fields. Physical book size, page count, topic and the same author having published a poetry book proved worthy in the classification. From these feature sets I handpicked the best performing features into one superset. The resulting model performed really good: despite the compactness of the metadata, the poetry books could be tracked with a precision over 90% and a recall over 86%. I then made another run with the superset to seek the poetry books, which did not have genre field annotated in the catalogue. Combining the results from the run with close reading revealed over 14,000 unannotated poetry books. I sampled one hundred of both poetry and non-poetry books to manually estimate the correctness of the predictions and found out an annotation bias in the catalogue. The bias seems to come from the fact, that the genre information has been annotated more frequently for broadside poetry books, than for the other broadsides. Excluding broadsides from my samples I got a recall value 94% and precision 98%. My research strongly suggest, that semi-supervised learning can be applied with library catalogues to fill in missing annotations, but this requires close attention to avoid possible pitfalls. Poster [publication ready] Open Digital Humanities: International Relations in PARTHENOS Bente Maegaard University of Copenhagen, CLARIN ERIC One of the strong instruments for the promotion of Open Science in Digital Humanities is research infrastructures. PARTHENOS is a European research infrastructure project, basically built upon collaboration between two large the research infrastructures in the humanities CLARIN and DARIAH, plus a number of other initiatives. PARTHENOS aims at strengthening the cohesion of research in the broad sector of Linguistic Studies, Humanities, Cultural Heritage, History, Archaeology and related fields. This is the context in which we should see the efforts related to international liaisons. This effort takes its point of departure in the existing international relations, so the first action was to collect information and to analyse it along different dimensions. Secondly, we want to analyse the purpose and aims of international collaboration. There are many ideas about how the international network may be strengthened and exploited, so that higher quality is obtained, and more data, tools and services are shared. The main task of the next year will be to first agree on a strategy and then implement it in collaboration with the rest of the project. By doing so, the PARTHENOS partners will be contributing even more to the European Open Science Policies. Poster [abstract] The New Face of Ethnography: Utilizing Cyberspace as an Alternative Study Site Karen Lisa Deeming University of California, Merced, American adoption has a familiar mission to find families for children but becomes strange when turned on its head and exposed as an institution that instead finds children for families who are willing to pay any price for a child. Its evolution, from orphan trains to open adoptions, has answered questions about biological associations but has conflated the interconnection of identity with conflicting narratives of community, kinship and self. How do the experiences of the adoption constellation reconceptualize the national image of adoption as a win-win solution to a social problem? My research explores the language utilized in multiple adoption narratives to determine individual and universal feelings that adoptees, birth parents, and adoptive parents experience regarding the transfer of children in the United States and the long term emotional outcomes for these groups. My unique approach to ethnographic research includes a hybrid digital and humanistic approach using online and offline interactions to gather data. As is the case with all methodology, online ethnography presents both benefits and problems. On the plus side, online communities break down the walls of networks, creating digitally mediated social spaces. The Internet provides a platform for social interactions where real and virtual worlds shift and conflate. Social interactions in cybernetic environments present another option for social researchers and offer significant advantages for data collection, collaboration, and maintenance of research relationships. For some research subjects, such as members of the adoption constellation, locating target groups presents challenges for domestic adoption researchers. Online groups such as Facebook pages dedicated to specific members of the adoption triad offer a resolution to this challenge, acting as self-sorted focus groups with participants eager to provide their narratives and experiences. Ethnography involves understanding how people experience their lives through observation and non-directed interaction, with a goal of observing participants’ behavior and reactions on their own terms; this can be achieved through the presumed anonymity of online interaction. Electronic ethnography provides valuable insights and data; however, on the negative side, the danger of groupthink in Facebook communities can both attract and generate homogeneous experiences regarding adoption issues. I argue that the benefit of online ethnography outweighs the problems and can provide important, previously unexpressed views to better analyze topics such as the adoption experience. Social interactions in cybernetic environments offer significant advantages for data collection, collaboration, and maintenance of research relationships as it remains a fluid yet stable alternate social space. Late-Breaking Work Elias Lönnrot Letters Online Kirsi Keravuori, Maria Niku Finnish Literature Society The correspondence of Elias Lönnrot (1802–1884, doctor, philologist and creator of the national epic Kalevala) comprises of 2 500 letters or drafts written by Lönnrot and 3 500 letters received. Elias Lönnrot Letters Online (http://lonnrot.finlit.fi/omeka/), first published in April 2017, is the conlusion of several decades of research, of transcribing and digitizing letters and of writing commentaries. The online edition is designed not only for those interested in the life and work of Lönnrot himself, but more generally to scholars and general public interested in the work and mentality of the Finnish 19th century nationalistic academic community , their language practices both in Swedish and in Finnish, and in the study of epistolary culture. The rich, versatile correspondence offers source material for research in biography, folklores studies and literary studies; for general history as well as medical history and the history of ideas; for the study of ego documents and networks; and for corpus linguistics and history of language. As of January 2018, the edition contains about 2000 letters and drafts of letters sent by Elias Lönnrot (1802-1884, doctor, philologist and creator of the national epic Kalevala). These are mostly private letters. The official letters, such as the medical reports submitted by Lönnrot in his office as a physician, will be added during 2018. The final stage will involve finding a suitable way of publishing for the approximately 3500 letters that Lönnrot received. The edition is built on the open-source publishing platform Omeka. Each letter and draft of letter is published as facsimile images and an XML/TEI5 file, which contains metadata and transcription. The letters are organised into collections according to recipient, with the exception of for example Lönnrot's family letters, which are published in a single collection. An open text search covers the metadata and transcriptions. This is a faceted search powered by Apache's Solr which allows limiting the initial search by collection, date, language, type of document and writing location. In addition, Omeka's own search can be used to find letters based on a handful of metadata fields. The solutions adopted for the Lönnrot edition differ in some respects from the established practices of digital publishing of manuscripts in the humanities. In particular, the TEI encoding of the transcriptions is lighter than in many other scholarly editions. Lönnrot's own markings – underlinings, additions, deletions – and unclear and indecipherable sections in the texts are encoded, but place and personal names are not. This is partially due to the extensive amount of work such detailed encoding would require, partially because the open text search provides quick and easy access to the same information. The guiding principle of Elias Lönnrot Letters is openness of data. All the data contained in the edition is made openly available. Firstly, the XML/TEI5 files are available for download, and researchers and other users are free to modify them for their own purposes. The users can download the XML/TEI5 files of all the letters, or of a smaller section such as an individual collection. The feature is also integrated in the open text search, and can be used both for all the results produced by a search and a smaller section of the results limited by one or more facets. Thus, an individual researcher can download the XML files of the letters and study them for example with the linguistic tools provided by the Language Bank of Finland. Similarly, the raw data is available for processing and modifying by those researchers who use and develop digital humanities tools and methods to solve research questions. Secondly, the letter transcriptions are made available for download as plain text. Data in this format is needed for qualitative analysis tools like Atlas. In addition, researchers in humanities do not all need XML files but will benefit from the ability to store relevant data in an easily readable format. Thirdly, users of the edition can export the statistical data contained in the facet listing of each search result for processing and visualization with tools like Excel. Statistical data like this is significant in handling large masses of data, as it can reveal aspects that would remain hidden when examining individual documents. For example, it may be relevant to a researcher in what era and with whom Lönnrot primarily discussed a given theme. The statistical data of the facet search readily reveals such information, while compiling such statistics by manually going through thousands of letters would be an impossibly long process. The easy availability of data in Elias Lönnrot Letters Online will hopefully foster collaboration and enrich research in general. The SKS is already collaborating with Finn-Clarin and the Language Bank, which have received the XML/TEI5 files. As Lönnrot's letters form an exceptionally large collection of manuscripts written by one hand, a section of the letters together with their transcriptions was given to the international READ project, which is working to develop machine recognition of old handwritten texts. A third collaborating partner is the project "STRATAS – Intefacing structured and unstructured data in sociolinguistic research on language change". Late-Breaking Work KuKa Digi -project Tiina H. Airaksinen, Anna-Leena Korpijärvi University of Helsinki This poster presents a sample of the Cultural Studies BA program’s Digital Leap project called KuKa Digi. The Digital Leap is a university wide project that aims to support digitalization in both learning and teaching in the new degree programs at the University of Helsinki. For more information on the University of Helsinki’s Digital Leap program, please refer to: http://blogs.helsinki.fi/digiloikka/ . The new Bachelor’s Program in Cultural Studies, was among the projects selected for the 2018-2019 round of the Digital Leap. The primary goal of the KuKa Digi project is to produce meaningful digital material for both teaching and learning purposes. The KuKa Digi project aims to develop the program’s courses, learning environments and materials into a more digital direction. Another goal of the project is to produce an introductory MOOC –course on Cultural Studies for university students, as well as students studying for their A-levels, who may be planning to apply for the Cultural Studies BA program. Finally, we will write a research article to assess the use of digital environments in teaching and learning processes within Cultural Studies BA program. Kuka Digi –project encourages students and teachers to co-operatively plan digital learning environments that are also useful in building up students’ academic portfolio and enhance their working life skills. The core idea of the project is to create a digital platform or database for teachers, researchers and students in the field of Cultural Studies. Academic networking sites do exist, however they are not without issues. Many of them are either not accessible, or very useful for students, who have not developed their academic careers very far yet. In addition to this, some of these sites are only partially free of charge. The digital platform will act as a place where students, teachers and researchers alike can have the opportunity to network, advertise their expertise and specialization as well as, come into contact with the media, cultural agencies, companies and much more. The general vision for this platform is that it will be user friendly, flexible as well as, act as an “academic Linked In”. The database will be available in Finnish, Swedish and English. The database will include the current students, teachers and experts, who are associated with the program. Furthermore, the platform will include a feature called the digital portfolio. This will be especially useful for our students, as it is intended to be a digital tool with which they can develop their own expertise within the field of Cultural Studies. Finally, the portfolio will act as a digital business card for the students. The Project poster presented at the conference illustrates the ideas and concepts for the platform in more detail. For more information on the project and its other goals, please refer to the project blog at: http://blogs.helsinki.fi/kuka-digi/ Late-Breaking Work Topic modelling and qualitative textual analysis Karoliina Isoaho, Daria Gritsenko University of Helsinki, The pursuit of big data is transforming qualitative textual analysis—a laborious activity that has conventionally been executed manually by researchers. Access to data of unprecedented scale and scope has created a need to both analyse large data sets efficiently and react to their emergence in a near-real-time manner (Mills, 2017). As a result, research practices are also changing. A growing number of scholars have experimented with using machine learning as the main or complementary method for text analysis. Even if the most audacious assumptions ‘on the superior forms of intelligence and erudition’ of big data analysis are today critically challenged by qualitative and mixed-method researchers (Mills, 2017: 2), it is imperative for scholars using qualitative methods to consider the role of computational techniques in their research (Janasik, Honkela and Bruun, 2009). Social scientists are especially intrigued by the potential of topic modelling (TM), a machine learning method for big data analysis (Blei, 2012), as a tool for analysis of textual data. This research contributes to a critical discussion in social science methodologies: how topic modeling can concretely be incorporated into existing processes of qualitative textual analysis and interpretation. Some recent studies paid attention to the methodological dimensions of TM vis-à-vis textual analysis. However, these developments remain sporadic, exemplifying a need for a systematic account of the conditions under which TM can be useful for social scientists engaged in textual analysis. This paper builds upon the existing discussions, and takes a step further by comparing the assumptions, analytical procedures and conventional usage of qualitative textual analysis methods and TM. Our findings show that for content and classification methods, embedding TM into research design can partially and, arguably, in some cases fully automate the analysis. Discourse and representation methods can be augmented with TM in sequential mixed-method research design. Summing up, we see avenues for TM both in embedded and sequential mixed-method research design. This is in line with previous work on mixed-method research that has challenged the traditional assumption of there being a clear division between qualitative and quantitative methods. Scholarly capacity to craft a robust research design depends on researchers’ familiarity with specific techniques, their epistemological assumptions, and good knowledge of the phenomena that are being investigated to facilitate the substantial interpretation of the results. We expect this research to help identify and address the critical points, thereby assisting researchers in the development of novel mixed-method designs that unlock the potential of TM in qualitative textual analysis without compromising methodological robustness. Blei, D. M. (2012) ‘Probabilistic topic models’, Communications of the ACM, 55(4), p. 77. Janasik, N., Honkela, T. and Bruun, H. (2009) ‘Text Mining in Qualitative Research’, Organizational Research Methods, 12(3), pp. 436–460. Mills, K. A. (2017) ‘What are the threats and potentials of big data for qualitative research?’, Qualitative Research, p. 146879411774346. Late-Breaking Work Local Letters to Newspapers - Digital History Project Heikki Kokko University of Tampere, The Centre of Excellence in the History of Experiences (HEX) The Local Letters to Newspapers is a digital history project of the Academy of Finland Centre of Excellence in the History of Experiences HEX (2018–2025), hosted by University of Tampere. The objective is to make a new kind of digital research material available from the 19th and the early 20th century Finnish society. The aim is to introduce a database of the readers' letters submitted to the Finnish press that could be studied both qualitatively and quantitatively. The database will allow analyzing the 19th and 20th century global reality through a case study of the Finnish society. It will enable a wide range of research topics and open a path to various research approaches, especially the study of human experiences. Late-Breaking Work Lessons Learned from Historical Pandemics. Using crowdsourcing 2.0 and Citizen Science to map the Spanish Flus spatial and social network. Søren Poder Aarhus City Archives By Søren K. Poder MA. In history & Astrid Lykke Birkving, MA in intellectual History Aarhus City Archvies \| Redia a/s In 1918 the World was struck by the most devastating disease in recorded history - today known as the Spanish Flu. In less than one year nearly two third of world’s population came down with influenza. Of which between forty and one hundred million people died. The Spanish Flu in 1918 did not originated in Spain, but most likely on the North American east coast in February 1918. By the middle of Marts, the influenza had spread to most of the overcrowded American army camps from where it soon was carried to the trenches in France and the rest of the World. This part of the story is well known. In contrast the diffusion of the 1918-pandemic, and the seasonal epidemics for that matter, on the regional and local level is still largely obscure. For instance, an explanation on why epidemics evidently tends to follow significantly different paths in different urban areas that otherwise seems to share a common social, commercial and cultural profile, tend to be more theoretical then based on evidence. For one sole reason – the lack of adequate data. As part of the incessantly scientific interest in historical epidemics, the purpose of this research project is to identify the social, economic and cultural preconditions that most likely determines a given type of locality’s ability to spread or halter an epidemic’s hieratical diffusion. Crowdsourcing 2.0 To meet ends data large amounts of data from a variety of different historical sources as to be collected and linked together. To do this we use traditional crowdsourcing techniques, where volunteers participates in transcribing different historical documents. Death certificates, census, patient charts etc. But just as important does the collected transcription form the base for a text recognition ML module that in time will be able recognize specific entities in a document – persons, placers, diagnoses dates ect. Late-Breaking Work Analysing Swedish Parliamentary Voting Data Jacobo Rouces, Nina Tahmasebi, Lars Borin, Stian Rødven Eide University of Gothenburg, We used publicly available data from voting sessions in the Swedish Parliament to represent each member of parliament (MP) as a vector in a space defined by their voting record between the years 2014 and 2017. We then applied matrix factorization techniques that enabled us to find insightful projections of this data. Namely, it allowed the assessment of the level of clustering of MPs according to their party line while at the same time identifying MPs whose voting record is closer to other parties'. It also provided a data-driven multi-dimensional political compass that allows to ascertain similitudes and differences between MPs and political parties. Currently, the axes of the compass are unlabeled and therefore they lack a clear interpretation, but we plan to apply language technology on the parliamentary discussions associated to the voting sessions on order to identify the topics associated to these axis. Late-Breaking Work Automated Cognate Discovery in the Context of Low-Resource Sami Languages Eliel Soisalon-Soininen, Mika Hämäläinen University of Helsinki 1 Introduction The goal of our project is to automatically find candidates for etymologically related words, known as cognates, for different Sami languages. At first, we will focus on North Sami, South Sami and Skolt Sami nouns by comparing their inflectional forms with each other. The reason why we look at the inflections is that, in Uralic languages, it is common that there are changes in the word stem when the word is inflected in different cases. When finding cognates, the non-nominative stems might reveal more about a cognate relationship in some cases. For example, the South Sami word for arm, g ̈ıete, is closer to the partitive of the Finnish word k ̈att ̈a than to the nominative form k ̈asi of the same word. The fact that a great deal of previous work already exists related to etymolo- gies of words in different Sami languages [2, 4, 8] provides us with an interesting test bed for developing our automatic methods. The results can easily be vali- dated against databases such as A ́lgu [1] which incorporates results of different studies in Sami etymology in a machine-readable database. With the help of a gold corpus, such as A ́lgu, we can perfect our method to function well in the case of the three aforementioned Sami languages. Later, we can expand the set of languages used to other Uralic languages such as Erzya and Moksha. This is achievable as we are basing our method on the data and tools developed in the Giellatekno infrastructure [11] for Uralic languages. Giellatekno has a harmonized set of tools and dictionaries for around 20 different Uralic languages allowing us to bootstrap more languages into our method. 2 Related Work In historical linguistics, cognate sets have been traditionally identified using the comparative method, the manual identification of systematic sound corre- spondences across words in pairs of languages. Along with the rapid increase in digitally available language data, computational approaches to automate this process have become increasingly attractive. Computationally, automatic cognate identification can be considered a prob- lem of clustering similar strings together, according to pairwise similarity scores given by some distance metric. Another approach to the problem is pairwise classification of word pairs as cognates or non-cognates. Examples of common distance metrics for string comparison include edit distance, longest common subsequence, and Dice coefficient. The string edit distance is often used as a baseline for word comparison, measuring word similarity simply as the amount of character or phoneme in- sertions, deletions, and substitutions required to make one word equivalent to the other. However, in language change, certain sound correspondences are more likely than others. Several methods rely on such linguistic knowledge by convert- ing sounds into sound classes according to phonetic similarity [?]. For example, [15] consider a pair of words to be cognates when they match in their first two consonant classes. In addition to such heuristics, a common approach to automatic cognate identification is to use edit distance metrics using weightings based on previ- ously identified regular sound correspondences. Such correspondences can also be learned automatically by aligning the characters of a set of initial cognate pairs [3,7]. In addition to sound correspondences, [14] and [6] also utilise se- mantic information of word pairs, as cognates tend to have similar, though not necessarily equivalent, meaning. Another method heavily reliant on prior lin- guistic knowledge is the LexStat method [9], requiring a sound correspondence matrix, and semantic alignment. However, in the context of low-resource languages, prior linguistic knowledge such as initial cognate sets, semantic information, or phonetic transcriptions are rarely available. Therefore, cognate identification methods applicable to low- resource languages calls for unsupervised approaches. For example, [10] address this issue by investigating edit distance metrics based on embedding characters into a vector space, where character similarity depends on the set of characters they co-occur with. In addition, [12] investigate several unsupervised approaches such as hidden Markov models and pointwise mutual information, while also combining these with heuristic methods for improved performance. 3 Corpus The initial plan is to base our method on the nominal XML dictionaries for the three Sami languages available on the Giellatekno infrastructure. Apart from just translations, these dictionaries contain also additional lexical information to a varying degree. The additional information which might benefit our re- search goals are cognate relationships, semantic tags, morphological information, derivation and example sentences. For each noun the noun dictionaries, we produce a list of all its inflections in different grammatical numbers and cases. This is done by using a Python library called Uralic NLP [5], specialized in NLP for Uralic languages. Uralic NLP uses FSTs (finite-state-transducers) from the Giellatekno infrastructure to produce the different morphological forms. We are also considering a possibility of including larger text corpora in these languages as a part of our method for finding cognates. However, theses languages have notoriously small corpora available, which might render them insufficient for our purposes. 4 Future Work Our research is currently at its early stages. The immediate future task is to start implementing different methods based on the previous research to solve the problem. We will first start with edit distance approaches to see what kind of information those can reveal and move towards a more complex solution from there. A longer-term future plan is to include more languages into the research. We are also interested in a collaboration with linguists who could take a more qualitative look at the cognates found by our method. This will nourish inter- disciplinary collaboration and exchange of ideas between scholars of different backgrounds. We are also committed to releasing the results produced by our method to a wider audience to use and profit from. This will be done by including the results as a part of the XML dictionaries in the Giellatekno infrastructure and also by releasing them in an open-access MediaWiki based dictionary for Uralic languages [13] developed in the University of Helsinki. References 1. A ́lgu-tietokanta. saamelaiskielten etymologinen tietokanta (Nov 2006), http://kaino.kotus.fi/algu/ 2. Aikio, A.: The Saami loanwords in Finnish and Karelian. Ph.D. thesis, University of Oulu, Faculty of Humanities (2009) 3. Ciobanu, A.M., Dinu, L.P.: Automatic detection of cognates using orthographic alignment. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). vol. 2, pp. 99–105 (2014) 4. Ha ̈kkinen, K.: Suomen kirjakielen saamelaiset lainat. Teoksessa Sa ́mit, sa ́nit, sa ́tneha ́mit. Riepmoˇca ́la Pekka Sammallahtii miessema ́nu 21, 161–182 (2007) 5. Ha ̈ma ̈la ̈inen, M.: UralicNLP (Jan 2018), https://doi.org/10.5281/zenodo.1143638, doi: 10.5281/zenodo.1143638 6. Hauer, B., Kondrak, G.: Clustering semantically equivalent words into cognate sets in multilingual lists. In: Proceedings of 5th international joint conference on natural language processing. pp. 865–873 (2011) 7. Kondrak, G.: Identification of cognates and recurrent sound correspondences in word lists. TAL 50(2), 201–235 (2009) 8. Koponen, E.: Lappische lehnwo ̈rter im finnischen und karelischen. Lapponica et Uralica. 100 Jahre finnisch-ugrischer Unterricht an der Universita ̈t Uppsala. Vortra ̈ge am Jubil ̈aumssymposium 20.–23. April 1994 pp. 83–98 (1996) 9. List,J.M.,Greenhill,S.J.,Gray,R.D.:Thepotentialofautomaticwordcomparison for historical linguistics. PloS one 12(1), e0170046 (2017) 10. McCoy, R.T., Frank, R.: Phonologically informed edit distance algorithms for word alignment with low-resource languages. Proceedings of 11. Moshagen, S.N., Pirinen, T.A., Trosterud, T.: Building an open-source develop- ment infrastructure for language technology projects. In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16. pp. 343–352. No. 85, Linkping University Electronic Press; Linkpings universitet (2013) 12. Rama, T., Wahle, J., Sofroniev, P., Ja ̈ger, G.: Fast and unsupervised methods for multilingual cognate clustering. arXiv preprint arXiv:1702.04938 (2017) 13. Rueter, J., Ha ̈m ̈al ̈ainen, M.: Synchronized mediawiki based analyzer dictionary development. In: Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages. pp. 1–7 (2017) 14. St Arnaud, A., Beck, D., Kondrak, G.: Identifying cognate sets across dictionaries of related languages. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pp. 2519–2528 (2017) 15. Turchin, P., Peiros, I., Murray, G.M.: Analyzing genetic connections between lan- guages by matching consonant classes. Vestnik RGGU. Seriya ”Filologiya. Voprosy yazykovogo rodstva”, (5 (48)) (2010) Late-Breaking Work Dissertations from Uppsala University 1602-1855 on the internet Anna Cecilia Fredriksson Uppsala University, Uppsala University Library At Uppsala University Library, a long-term project is under way which aims at making the dissertations, that is theses, submitted at Uppsala University in 1602-1855 easy to find and read on the Internet. The work includes metadata production, scanning and OCR processing as well as publication of images of the dissertations in full-text searchable pdf files. So far, approximately 3,000 dissertations have been digitized and made accessible on the Internet via the DiVA portal, Uppsala University’s repository for research publications. All in all, there are about 12,000 dissertations of about 20 pages each on average to be scanned. This work is done by hand, due to the age of the material. The project aims to be completed in 2020. Why did we prioritize dissertations? Even before the project started, dissertations were valued research material, and the physical dissertations were frequently on loan. Their popularity was primarily due to the fact that generally, studying university dissertations is a great way to study evolvements and changes in society. In the same way as doctoral theses do today, the older dissertations reflect what was going on in the country, at the University, and in the intellectual Western world on the whole at a certain period of time. The great mass of them makes them especially suitable for comparative and longitudinal studies, and provides excellent chances for scholars to find material little used or not used at all in previous research. Swedish older dissertations including those of today’s Finland specifically are also comparatively easy to find. In contrast to many other European libraries with an even longer history, collectors published bibliographies of Swedish dissertations as far back as 250 years ago. Our dissertations are also organized, bound and physically easily accessible. Last year the cataloguing of the Uppsala dissertations was completed according to modern standards in LIBRIS. That made them searchable according to subject and word in title, which was not possible before. All this made the digitization process smoother than that of many other kinds of cultural heritage material. The digital publication of the dissertations naturally made access to them even easier for University staff and students as well as lifelong learners in Sweden and abroad. How are the dissertations used today? In actual research today, we see that the material is frequently consulted in all fields of history. Dissertations provide scholars in the fields of history of ideas and history of science with insight into the status of a certain subject matter in Sweden in various periods of time, often in relation to the contemporary discussion on the European continent. The same goes for studies in history of literature and history of religion. Many of the dissertations examine subjects that remain part of the public debate today, and are therefore of interest for scholars in the political and social sciences. The languages of the dissertations are studied by scholars of Semitic, Classical and Scandinavian languages, and the dissertations often contain the very first editions and translations of certain ancient manuscripts in Arabic and Runic script. There is also a social dimension of the dissertations worthy of attention, as dedications and gratulatory poems in the dissertations mirror social networks in the educated stratum of Sweden in various periods of time. Illustrations in the dissertations were often made by local artists or the students themselves, and the great mass of gratulatory poems mirrors the less well-known side of poetry in early modern Sweden. Our users The users of the physical items are primarily university scholars, primarily our own University, but there is also quite a great deal of interest from abroad. Not least from our neighboring country Finland and from the Baltic States, which were for some time within the Swedish realm. Many projects are going on right now which include our dissertations as research material or which have them as their primary source material; Swedish projects as well as international. As Sweden as a part of learned Europe more or less shared the values, objects and methods of the Western academic world as a whole, to study Swedish science and scholarship is to study an important part of Western science and scholarship. As for who uses our digital dissertations, we in fact do not know. The great majority of the dissertations are written in Latin, as in all countries of Europe and North America, Latin was the vehicle for academic discussion in the early modern age. In the first half of the 19th century, Swedish became more common in the Uppsala dissertations. Among the ones digitized and published so far, a great deal are in Swedish. As for the Latin ones, they too are clearly much used. Although knowledge of Latin is quite unusual in Sweden, foreign scholars in the various fields of history often had Latin as part of their curriculum. Obviously, our users know at least enough Latin to recognize if a passage treats the topic of their interest. They can also identify which documents are important to them and extract the most important information from it. If the document is central, it is possible to hire a translator. But we believe that we also reach out to the lifelong learners, or the so-called “ordinary people”. The older dissertations examine every conceivable subject and they offer pleasant reading even for non-specialists, or people who use the Internet for genealogical research. The full text publication makes the dissertation show up, perhaps unexpectedly, when a person is looking for a certain topic or a certain word. Whoever the users the digital publication of the dissertations has been well received, and far beyond expectations. The first three test years of approximately 2,500 digitized dissertations published resulted in close to one million visits and over 170,000 downloads, i.e. over 4,700 per month. Even if we don’t – or perhaps because we don’t – either offer or demand advanced technologies for the use of these dissertations. The digital publication and the new possibilities for research The database in which the dissertations are stored and presented is the same database in which researchers, scholars and students of Uppsala University, and other Swedish universities, too, currently register their publications with the option to publish them digitally. This clears a path for new possibilities for researchers to become aware of and study the texts. Most importantly, it enables users to find documents in their field, spanning a period of 400 years in one search session. A great deal of the medical terms of diseases and body parts, chemical designations, and, of course, juridical and botanical terms are Latin and the same as were used 400 years ago, and can thus be used for localizing text passages on these topics. But the form of the text can be studied, too. Linguists would find it useful to make quantitative studies of the use of certain words or expressions, or just to find the words of interest for further studies. The usefulness of full-text databases are all known to us. But often one as a user gets either a well-working search system or a great mass of important texts, and seldom both. This problem is solved here by the interconnection between the publication database DiVA and the Swedish National Research Library System LIBRIS. The combination makes it possible to use an advanced search system with high functionality, thus reducing the Internet problem of too many irrelevant hits. It gives direct access to the digital full text in DiVA, and the option to order the physical book if the scholar needs to see the original at our library. Not least important, there is qualified staff appointed to care for the system’s long-term maintenance and updates, as part of their everyday tasks at the University Library. Also, the library is open for discussion with users. The practical work within the project and related issues As part of the digitization project, the images of the text pages are OCR-processed in order to create searchable full-text pdf files. The OCR process gives various results depending on the age and the language of the text. The OCR processing of dissertations in Swedish and Latin from ca. 1800 onwards results in OCR texts with a high degree of accuracy, that is, between 80 and 90 per cent, whereas older dissertations in Latin and in languages written in other alphabets will contain more inaccuracies. On this point we are not satisfied. Almost perfect results when it comes to the OCR-read text, or proof-reading, is a basic requirement for the full use and potential of this material. However, in this respect, we are dependent upon the technology which is available on the market, as this provides the best and safest product. These products were not developed for handling printing types of various sorts and sizes from the 17th and 18th centuries, and the development of these techniques, except when it comes to “Fraktur”, is slow or non-existing. If you want to pursue further studies of the documents, you can download the documents for free to your own computer. There are free programs on the Internet that help you merge several documents of your choice into one document, in order for you to be able to search through a certain mass of text. If you are searching for something very particular, you could of course also perform a word search in Google. One of our wishes for the future is to make it possible for our users to search in several documents of their specific choice at one time, without them having to download the documents to their computer. So, most important for us today within the dissertation project: 1) Better OCR for older texts 2) Easier ways to search in a large text mass of your own choice. Future use and collaboration with scholars and researchers The development of digital techniques for the further use of these texts is a future desideratum. We therefore aim to increase our collaboration with researchers who want to explore new methods to make more out of the texts. However, we always have to take into account the special demands from society when it comes to the work we, as an institute of the state, are conducting – in contrast to the work conducted by e.g. Google Books or research projects with temporary funding. We are expected to produce both images and metadata of a reasonably high quality – a product that the University can ‘stand for’. What we produce should have a lasting value – and ideally be possible to use for centuries to come. What we produce should be compatible with other existing retrieval systems and library systems. Important, in my opinion, is reliability and citability. A great problem with research on digitally borne material is, in my opinion, that it constantly changes, with respect to both their contents and where to find them. This puts the fundamental principle of modern science, the possibility to control results, out of the running. This is a challenge for Digital Humanities which, with the current pace of development, surely will be solved in the near future. Late-Breaking Work Normalizing Early English Letters for Neologism Retrieval Mika Hämäläinen, Tanja Säily, Eetu Mäkelä University of Helsinki Introduction Our project studies social aspects of innovative vocabulary use in early English letters. In this abstract we describe the current state of our method for detecting neologisms. The problem we are facing at the moment is the fact that our corpus consists of non-normalized text. Therefore, spelling normalization is the first step we need to solve before we can apply automatic methods to the whole corpus. Corpus We use CEEC (Corpora of Early English Correspondence) [9] as the corpus for our research. The corpus consists of letters ranging from the 15th century to the 19th century and it represents a wide social spectrum, richly documented in the metadata associated with the corpus, including information on e.g. socioeconomic status, gender, age, domicile and the relationship between the writer and recipient. Finding Neologisms In order to find neologisms, we use the information of the earliest attestation of words recorded in the Oxford English Dictionary (OED) [10]. Each lemma in the OED has information about its attestations, but also variant spelling forms and inflections. How we proceed in automatically finding neologism candidates is as follows. We get a list of all the individual words in the corpus, and we retrieve their earliest attestation from the OED. If we find a letter where the word has been used before the earliest attestation recorded in the OED, we are dealing with a possible neologism, such as the word "monotonous" in (1), which antedates the first attestation date given in the OED by two years (1774 vs. 1776). (1) How I shall accent & express, after having been so long cramped with the monotonous impotence of a harpsichord! (Thomas Twining to Charles Burney, 1774; TWINING_017) The problem, however, is that our corpus consists of texts written in different time periods, which means that there is a wide range of alternative spellings for words. Therefore, a great part of the corpus cannot be directly mapped to the OED. Normalizing with the Existing Methods Part of the CEEC (from the 16th century onwards) has been normalized with VARD2 [3] in a semi-automated manner; however, the automatic normalization is only applied to sufficiently frequent words, whereas neologisms are often rare words. We take these normalizations and extrapolate them over the whole corpus. We also used MorphAdorner [5] to produce normalizations for the words in the corpus. After this, we compared the newly normalized forms with those in the OED taking into account the variant forms listed in the OED. NLTK's [4] lemmatizer was used to produce lemmas from the normalized inflected forms to map them to the OED. In doing so, we were able to map 65,848 word forms of the corpus to the OED. However, around 85,362 word forms still remain without mapping to the OED. Different Approaches For the remaining non-normalized words, we have tried a number of different approaches. - Rules - SMT - NMT - Edit distance, semantics and pronunciation The simplest one of them is running the hand-written VARD2 normalization rules for the whole corpus. These are simple replacement rules that replace a sequence of characters with another one either in the beginning, end or middle of a word. An example of such a rule is replacing "yes" with "ies" at the end of the word. We have also trained a statistical machine translation model (with Moses [7]}) and a neural machine translation model (with OpenNMT [6]). SMT has previously been used in the normalization task, for example in [11]. Both of the models are character based treating the known non-normalized to normalized word pairs as two languages for the translation model. The language model used for the SMT model is the British National Corpus (BNC) [1]. One more approach we have tried is to compare the non-normalized words to the ones in the BNC by Levenshtein edit distance [8]. This results in long lists of normalization candidates, that we filter further by their semantic similarity, which means comparing the list of two word appearing immediately after and before the non-normalized word and the normalization candidates picking out the candidates with largest number of shared contextual words. And finally, filtering this list with Soundex pronunciation by edit distance. A similar method [2] has been used in the past for normalization which relied on the semantics and edit distance. The Open Question The above described methods produce results of varying degrees of success. However, none of them is reliable enough to be trusted above the rest. We are now in a situation in which at least one of the approaches finds the correct normalization most of the time. The next unsolved question is how to pick the correct normalization from the list of alternatives in an accurate way. Once the normalization has been solved, we are facing another problem which is mapping words to the OED correctly. For example, currently the verb "to moon" is mapped to the noun "mooning" recorded in the OED because it appeared in the present participle form in the corpus. This means that in the future, we have to come up with ways to tackle not only the problem of homonyms, but also the problem of polysemy. A word might have acquired a new meaning in one of our letters, but we cannot detect this word as a neologism candidate, because the word has existed in the language in a different meaning before. References 1. The British National Corpus, version 3 (BNC XML Edition). Distributed by Bodleian Libraries, University of Oxford, on behalf of the BNC Consortium (2007),http://www.natcorp.ox.ac.uk/ 2. Amoia, M., Martinez, J.M.: Using comparable collections of historical texts forbuilding a diachronic dictionary for spelling normalization. In: Proceedings of the7th workshop on language technology for cultural heritage, social sciences, andhumanities. pp. 84–89 (2013) 3. Baron, A., Rayson, P.: VARD2: a tool for dealing with spelling variation in histor-ical corpora (2008) 4. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’ReillyMedia (2009) 5. Burns, P.R.: Morphadorner v2: A java library for the morphological adornment ofEnglish language texts. Northwestern University, Evanston, IL (2013) 6. Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.M.: OpenNMT: Open-SourceToolkit for Neural Machine Translation. ArXiv e-prints 7. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N.,Cowan, B., Shen, W., Moran, C., Zens, R., et al.: Moses: Open source toolkit forstatistical machine translation. In: Proceedings of the 45th annual meeting of theACL on interactive poster and demonstration sessions. pp. 177–180. Associationfor Computational Linguistics (2007) 8. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, andreversals. In: Soviet physics doklady. vol. 10, pp. 707–710 (1966) 9. Nevalainen,T.,Raumolin-Brunberg,H.,Ker ̈anen,J.,Nevala,M.,Nurmi, A., Palander-Collin, M.: CEEC, Corpus of Early English Cor-respondence. Department of Modern Languages, University of Helsinki,http://www.helsinki.fi/varieng/CoRD/corpora/CEEC/ 10. OED: OED Online. Oxford University Press, http://www.oed.com/ 11. Pettersson, E., Megyesi, B., Tiedemann, J.: An SMT approach to automatic an-notation of historical text. In: Proceedings of the workshop on computational his-torical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALTProceedings Series 18. pp. 54–69. No. 087, Link ̈oping University Electronic Press(2013) Late-Breaking Work Triadic closure amplifies homophily in social networks Aili Asikainen¹, Gerardo Iñiguez², Kimmo Kaski¹, Mikko Kivelä¹ ¹Aalto University, Finland; ²Next Games, Finland Much of the structure in social networks can be explained by two seemingly separate network evolution mechanisms: triadic closure and homophily. While it is typical to analyse these mechanisms separately, empirical studies suggest that their dynamic interplay can be responsible for the striking homophily patterns seen in real social networks. By defining a network model with tunable amount of homophily and triadic closure, we find that their interplay produces a myriad of effects such as amplification of latent homophily and memory in social networks (hysteresis). We use empirical network datasets to estimate how much observed homophily could actually be an amplification induced by triadic closure, and have the networks reached a stable state in terms of their homophily. Beyond their role in characterizing the origins of homophily, our results may be useful in determining the processes by which structural constraints and personal preferences determine the shape and evolution of society.
2:30pm - 3:45pm	Plenary 4: Frans Mäyrä Session Chair: Eetu Mäkelä Game Culture Studies as Multidisciplinary (Digital) Cultural Studies. Watchable also remotely from PII, PIV and P674.
Think Corner
4:00pm - 5:30pm	F-TC-2: Games as Culture Session Chair: Frans Mäyrä
Think Corner	F-TC-2: Games as Culture Session Chair: Frans Mäyrä
	4:00pm - 4:15pm Short Paper (10+5min) [abstract] The Science of Sub-creation: Transmedial World Building in Fantasy-Based MMORPGs Rebecca Anderson University of Waterloo, The Games Institute, First Person Scholar My paper examines how virtual communities are created by fandoms in massively multi-player online role-playing games and it explores what kinds of self-construction emerge in these digital locales and how such self-construction reciprocally affects the living culture of the game. I assert that the universe of a fantasy-based MMORPG necessitates participatory culture: experiencing the story means participating in the culture of the story’s world; these experiences reciprocally affect the living culture of the game’s universe. The participation and investment of readers, viewers, and players in this world constitute what Carolyn Marvin calls a textual community or a group that “organize[s] around a presumptively shared, but distinctly practiced, epistemology of texts and interpretive procedures” (12). In other words, the textual community produces a shared discourse, one that informs and interrogates what it means to be a fan in both analogue and digital environments. My paper uses J.R.R. Tolkien’s Middle-earth as a case study to explore the creation and continuation of a fantastic universe, in this case Middle-earth, across mediums: a transmedial creation informed by its textual community. Building on the work of Mark J.P. Wolf, Colin B. Harvey, Celia Pearce, Matthew P. Miller, and Edward Castronova, my work reveals that the “worldness” of a transmedia universe, or the degree to which it exists as a complete and consistent cosmos, plays a core role in the production, acceptance, and continuation of its ontology among and across the fan communities respective to the mediums in which it operates. My paper argues that Tolkien’s literary texts and these associated adaptations are multi-participant sites in which participants negotiate their sense of self within a larger textual community. These multi-participant sites form the basis from which to investigate the larger social implications of selfhood and fan participation. My theoretical framework provides the means by which to situate the critical aesthetics relative to how this fictional universe draws participants in. Engaging with Gordon Calleja’s discussions on immersion and Luis O. Arata’s thoughts on interactivity, I demonstrate how the transmedial storyworld of Middle-earth not only constructs a sense of space but that it is precisely this sense of space that engages the reader, viewer or gamer. To situate the sense of self incurred between and because of narrative and storyworld environment, I draw from Andreas Gregersen’s work on embodiment and interface, as well as from Shawn P. Wilbur’s work on identity in virtual communities. Anne Balsamo and Rebecca Borgstrom each offer a theorization of the role-playing specific to the multiplayer environments of game-based adaptations, while William H. Huber’s work contextualizes the production of space in epic fantasy narratives. Together, my theoretical framework highlights how the spread of a transmedial fantastic narrative impacts the connection patterns across the textual community of a particular storyworld, as well as foregrounds how the narrative environment shapes the degree of participant engagement in and with the space of that storyworld. This proposal is for a long paper presentation; however, I'm able to condense if necessary to fit a short paper presentation. 4:15pm - 4:30pm Distinguished Short Paper (10+5min) [abstract] Layers of History in Digital Games Derek Fewster University of Helsinki, The past five years have seen a huge increase in historical games studies. Quite a few texts have tried to approach how history is presented and used in games, considering everything from philosophical points to more practical views related to historical culture and the many manifestations of heritage politics. The popularity of recent games like Assassin’s Creed, The Witcher and Elder Scrolls also manifests the current importance of deconstructing the messages and choices the games present. Their impact on the modern understanding of history, and the general idea of time and change, is yet to be seen in its full effect. The paper at hand is an attempt to structure the many layers or horizons of historicity in digital games as these, into a single taxonomic system for researchers. The suggestion considers the various consciousnesses of time and narrative models modern games work with. Several distinct horizons of time, both of design and of the related real life, are interwoven to form the end product. The field of historical game studies could find this tool quite useful, in its urgent need to systematize how digital culture is reshaping our minds and pasts. The model considers aspects like memory culture, uses of period art and apocalyptic events, narrative structures, in-game events and real world discourses as parts of how a perception of time and history is created or adapted. The suggested “layering of time” is applicable on a wide scale of digital games. 4:30pm - 4:45pm Short Paper (10+5min) [abstract] Critical Play, Hybrid Design and the Performance of Cultural Heritage Game/Stories Lissa Holloway-Attaway University of Skövde In my talk, I propose to discuss the critical relationship between games designed and developed for cultural heritage and emergent Digital Humanities (DH) initiatives that focus on (re-)inscribing and reflecting on the shifting boundaries of human agency and its attendant relations. In particular, I will highlight theoretical and practical humanistic models (for development and as objects of scholarly research) that are conceived in tension with more computational emphases and influences. I examine how digital heritage games move us from an understanding of digital humanities as a “tool” or “text” oriented discipline to one where we identify critical practices that actively engage and promote convergent, hybrid and ontologically complex techno-human subjects to enrich our field of inquiry as DH scholars. Drawing on principles such as embodiment, affect, and performativity, and analyzing transmedial storytelling and mixed reality games designed for heritage settings (and developed in my university research group), I argue for these games as an exemplary medium for enriching interdisciplinary digital humanities practices using methods currently called upon by recent DH scholarship. In these fully hybrid contexts where human/technology boundaries are richly intermingled, we recognize the importance of theoretical approaches for interpretation that are performative, not mechanistic (Drucker, in Gold, 2011): That is we look at emergent experiences, driven by human intervention, not affirmed by technological development and technical interface affordances. Such hybridity, driven by human/humanities approaches is explored more fully, for example, in Digital_Humanities by Burdick et al (2012) and by N. Katherine Hayles in How We Think: Digital Media and Contemporary Technogenesis (2012). Collectively these scholars reveal how transformative and emerging disciplines can work together to re-think the role of the organic-technical beings at the center (and found at the margins and in-between subjectivities) within new forward-thinking DH studies. Currently, Hayles and others, like Matthew Gold (2012) offer frameworks for more interdisciplinary Digital Humanities methods (including Comparative Media and Culture Studies approaches) that are richly informed by investigations into the changing role and function of the user of technologies and media and the human/social contexts for use. Hayles, for example, explicitly claims that in Digital Humanities humans “ think, through, with, and alongside media” (1). In essence, our thinking and being, our digitization and our human-ness are mutually productive and intertwined. Furthermore, we are multisensory in our access to knowing and we develop an understanding of the physical world in new ways that reorient our agencies and affects, redistributing them for other encounters with cultural and digital/material objects that are now ubiquitous and normalized. Ross Parry, museum studies scholar, supports a similar model for inquiry and future advancement, based on the premise that digital tool use is now fully implemented and accepted in museum contexts, and so now we must deepen and develop our inquiries and practice (Parry, 2013). He claims that digital technologies have become normative in museums and that currently we find ourselves, then, in the age of the postdigital. Here critical scrutiny is key and necessary to mark this advanced state of change. For Parry this is an opportune, yet delicate juncture that requires a radical deepening of our understanding of the museums’ relationship to digital tools: Postdigitality in the museum necessitates a rethinking of upon what museological and digital heritage research is predicated and on how its inquiry progresses. Plainly put, we have a space now (a duty even) to reframe our intellectual inquiry of digital in the museum to accommodate the postdigital condition. [Parry, 36] For Parry, as with current DH calls for development, we must now focus on the contextualized practices in which these technologies will inevitably engage designers and users and promote robust theoretical and practical applications. I argue that games, and in particular digital games designed for heritage experiences, are unique training grounds for such postdigital future development. They provide rich contexts for DH scholars working to deepen their understanding of performative and active interventions and intra-actions beyond texts and tools. As digital games have been adopted and ubiquitously assimilated in museums and heritage sites, we have opportunities to study experiences of users as they performatively engage postdigital museum sites through rich forms of hybrid play. In such games, nuanced forms of interdisciplinary communication and storytelling happen in deeply integrated and embedded user/technology relationships. In heritage settings, interpretation is key to understanding histories from multiple user-driven perspectives, and it happens in acts of dynamic emergence, not as the result of mechanistic affordance. As such DH designers and developers have much to learn from a rich body of games and heritage research, particularly that focused on critical and rhetorical design for play, Mixed Reality (MR) approaches and users’ bodies as integral to narrative design (Anderson et. al, 2010; Bogost, 2010; Flanagan, 2013; Mortara et. al, 2014; Rouse et. al, 2015; Sicart, 2011). MR provides a uniquely layered approach working across physical and digital artifacts and spaces, encouraging polysemic experiences that can support curators’ and historians’ desires to tell ever more complex and connected stories for museum and heritage site visitors, even involving visitors’ own voices in new ways. In combination, critical game design approaches and MR technologies, within the museum context, help re-center historical experience on the visitor’s body, voice, and agency, shifting emphasis away from material objects, also seen as static texts or sites for one-way, broadcast information. Re-centering the design on users’ embodied experience with critical play in mind, and in MR settings, offers rich scholarship for DH studies and provides a variety of heritage, museum, entertainment, and participatory design examples to enrich the field of study for open, future and forward thinking. Drawing on examples from heritage games developed within my university research group and in the heritage design network I co-founded, and implemented in museum and heritage sites, I will work to expose these connections. From transmedial children’s books focused on Nordic folktales, to playful AR experiences that expose the history of architectural achievements, as well as the meta reflections on the telling of those achievements in archival documentations (such as the development of the Brooklyn Bridge in the 19th C) I will provide an overview of how digital heritage games, in combination with new hybrid DH initiatives can be used for future development and research. This includes research around new digital literacies, collaborative and co-design approaches (with users) and experimental storytelling and narrative approaches for locative engagement in open-world settings, dependent on input from user/visitors. References Anderson, E. F., McLoughlin, L., Liarokapis, F., Peters, C., Petridis, P., de Freitas, S. Developing Serious Games for Cultural Heritage: A State-of-the-Art Review. In: Virtual Reality 14 (4). (2010) Burdick, A., Drucker, J., Lunenfeld, P., Presner, T., Schnapp, J. Digital_Humanities. MIT Press, Cambridge, MA (2012) Bogost, I. Persuasive Games: The Expressive Power of Videogames. MIT Press, Cambridge MA (2010) Flanagan, M. Critical Play: Radical Game Design. MIT Press, Cambridge MA (2013) Gold, M. K. Debates in the Digital Humanities. University of Minnesota Press, Minneapolis, MN (2012) Hayles, K. N. How We Think: Digital Media and Contemporary Technogenesis. Chicago, University of Chicago Press, Chicago Il (2012) Parry, R. The End of the Beginning: Normativity in the Postdigital Museum. In: Museum Worlds: Advances in Research, vol. 1, pgs. 24-39. Berghahn Books (2013) Mortara, M., Catalano, C.E., Bellotti, F., Fiucci, G., Houry-Panchetti, M., Panagiotis, P. Learning Cultural Heritage by Serious Games. In: Journal of Cultural Hertiage, vol. 15, no. 3, pp. 318-325. (2014) Rouse, R., Engberg, M., JafariNaimi, N., Bolter, J. D. (Guest Eds.) Special Section: Understanding Mixed Reality. In: Digital Creativity, vol. 26, issue 3-4, pp. 175-227. (2015) Sicart, M. The Ethics of Computer Games. MIT Press, Cambridge MA (2011) 4:45pm - 5:00pm Short Paper (10+5min) [publication ready] Researching Let’s Play gaming videos as gamevironments Xenia Zeiler University of Helsinki Let’s Plays, as a specific form of gaming videos, are a rather new phenomenon and it is not surprising that they are still relatively under-researched. So far, only a few publications focus on the theme. The specifics of Let’s Play gaming videos make them an unparalleled object of research in the vicinity of games – in the so-called gamevironments. The theoretical and methodical approach of the same name and literally merging the terms “games/gaming” – “environments” is first mentioned and discussed by Radde-Antweiler, Waltemathe and Zeiler 2014 who argue to broaden the study of video games, gaming and culture beyond media-centred approaches to better highlight recipient perspectives and actor-centred research. Gamevironments thus puts the spotlight on actors in their mediatized – and specifically gametized – life. 5:00pm - 5:15pm Short Paper (10+5min) [abstract] The plague transformed: City of Hunger as mutation of narrative and form Jennifer J Dellner Ocean County College, United States of America, This short paper proposes and argues the hypothesis that Minna Sundberg’s interactive game in development, City of Hunger, an offshoot or spin-off of her well respected digital comic, Stand Still Stay Silent, can be understood in terms of the ecology of the comic as a mutation of it; as such, her appropriation of a classic game genre and her storyline’s emphasis on the mechanical over the natural suggest promising avenues for understanding the uses of interactivity in the interpretation of narrative. In the game, the plague-illness of the comic’s ecology may or may not be gone, but conflict (vs. cooperation) becomes the primary mode of interaction for characters and reader-players alike. In order to produce the narrative, the reader-player will have to do battle as the characters do. Sundberg herself signals that her new genre is indivisible from the different ecology of the game world’s narrative. “City of Hunger will be a 2d narrative rpg with a turn-based battle system, mechanically inspired by your older final fantasy games, the Tales of-series and similar classical rpg's.” There will be a world of “rogue humans, mechanoids and mysterious alien beings to fight” (2017). While it remains to be seen how the game develops, its emphasis on machine-beings and aliens in a classic game environment ( a “shadow of the past”) suggests strongly that the use of interactivity within each narrative has an interpretive and not merely performative dimension. 5:15pm - 5:30pm Short Paper (10+5min) [abstract] Names as a Part of Game Design Lasse Hämäläinen University of Helsinki, Video games often consist of several separate spaces of play. They are called, depending on the speaker and the type of the game, for example levels, maps, tracks or worlds. In this paper, the term level is used. As there are usually many levels in a game, they need some kind of identifying elements. In some games, levels only have ordinal numbers (Level 1, Level 2 etc.), but in the other, they (also) have names. Names are an important part of game design, at least for three reasons. Firstly, giving names to places makes the imaginary world feel richer and deeper (Schell 2014: 351), improving the gameplay experience. Secondly, name gives the player first impression of the level (Rogers 2014: 220), helping him/her to perceive the level’s structure. And thirdly, level names are needed for discussing the levels. Members of a gaming community often want to share their experiences and emotions of the gameplay. When doing so, it is important to contextualize the events: in which level did X happen? Even though some game design scholars seem to recognize the importance of names, there are very few studies of them. This presentation is aimed to fill this blank. I have analyzed level names in Playforia Minigolf, an online minigolf game designed in Finland in 2002. The data include names all the 2,072 levels in the game. The analysis focuses especially on the principles of naming, or in other words, what kind of connection there is between the name and level’s characteristics. The presentation also examines the change of naming practices during the game’s 15-year history. The oldest names mostly describe the levels in a simple, neutral manner, while the newest names are far more ambiguous and rarely have anything to do with level’s characteristics. This change is probably caused by the change of level designers. First levels of the game were designed by its developers, game design professionals, but over time, the responsibility of designing levels has passed to the most passionate hobbyists of the game. This result might be an interesting for game studies and especially for the research of modding and modifications (see e.g. Unger 2012). REFERENCES Playforia (2002). Minigolf. Finland: Apaja Creative Solutions Oy. Rogers, Scott (2014). Level Up! The Guide to Great Video Game Design. Chichester: Wiley. Schell, Jesse (2014). The Art of Game Design: A Book of Lenses. CRC Press. Unger, Alexander (2012). Modding as a Part of Gaming Culture. – Fromme, Johannes & Alexander Unger (eds.): Computer Games and New Media Cultures. A Handbook of Digital Games Studies, 509–523.

Digital Humanities in the Nordic Countries
3rd Conference

7–9 March 2018, Helsinki

Conference Agenda

Session Overview
Location: Think Corner