Digital Humanities in the Nordic Countries 3rd Conference

Poster [abstract]

Shearing letters and art as digital cultural heritage, co-operation and basic research

Maria Elisabeth Stubb

Svenska litteratursällskapet i Finland,

Albert Edelfelts brev (edelfelt.fi) is a web publication developed at the Society of Swedish Literature in Finland. In co-operation with the Finnish National Gallery, we publish letters of the Finnish artist Albert Edelfelt (1854–1905) combined with pictures of his artworks. Albert Edelfelts brev received in 2016 the State Award for dissemination of information. The co-operation between institutions and basic research of the material has enabled a unique reconstruction of Edelfelt’s artistry and his time, for the service of researchers and other users. I will present how we have done it and how we plan to further develop the website.

The website Albert Edelfelts brev launched in September 2014, with a sample of Edelfelt’s letters and paintings. Our intention is to publish all the letters Albert Edelfelt wrote to his mother Alexandra (1833–1901). The collection consists of 1 310 letters, that range over 30 years and cover most of Edelfelt’s adult life. The letters are in the care of the Society of Swedish Literature in Finland. We also have to our disposal close to 7 000 pictures of Edelfelt’s paintings and sketches in the care of the Finnish National Gallery.

In the context of digital humanities, the volume of the material at hand is manageable. However, for researchers who think that they might have use of the material, but are unsure of exactly where or what to look for, it might be labour intensive to go through all the letters and pictures. We have combined professional expertise and basic research of the material with digital solutions to make it as easy as possible to take part of what the content can offer.

As editor of the web publication, I spend a considerable part of my work on basic research in identifying people, and pinpointing paintings and places that Edelfelt mentions in his letters. By linking the content of a letter to artworks, persons, places and subjects/reference words users can easily navigate in the material. Each letter, artwork and person has a page of its own. Even places and subjects are searchable and listed.

The letters are available as facsimile pictures of the handwritten pages. Each letter has a permanent web resource identifier (URN:NBN). In order to make it easier for users to decide if a letter is of interest, we have tagged subjects using reference words from ALLÄRS (common thesaurus in Swedish). We have also written abstracts of the content, divided them into separate “events” and tagged mentioned artworks, people and places to these events.

Each artwork of Edelfelt has a page of its own. Here, users find a picture of the artwork (if available) and earlier sketches of the artwork (if available). By looking at the pictures, they can see how the working process of the painting has developed. Users can also follow the process through what Edelfelt writes in his letters. All the events from the letter abstracts that are tagged to the specific artwork are listed in chronological order on the artwork-page.

Persons tagged in the letter abstracts also have pages of their own. On a person-page, users find basic facts and links to other webpages with information about the person. Any events from the letter abstracts mentioning the person are listed as well. In other words, through a one-click-solution users can find an overview on everything Edelfelt’s letters have to say about a specific person. Tagging persons to events has also made it possible to build graphs of a person’s social network; based on how many times other persons are tagged to the same events as the specific person. There is a link to these graphs on every person-page.

Apart from researchers who have a direct interest in the material, we have also wanted to open up the cultural heritage to a broader public and group of users. Each month the editorial staff writes a blog-post on SLS-bloggen (http://www.sls.fi/sv/blogg). Albert Edelfelts brev also has a profile on Facebook (https://www.facebook.com/albertedelfeltsbrev/) where we post excerpts of letters on the same date as Edelfelt wrote the original letter. By doing so we hope to give the public an insight in the life of Edelfelt and the material, and involve them in the progress of the project.

The web publication has open access. The mix of different sources and the co-operation with other heritage institutions has led to a mix of licenses for how users can copy and redistribute the published material. The Finnish National Gallery (FNG) owns copyright on its pictures in the publication and users have to get permission from FNG to copy and redistribute that material. The artwork-pages contain descriptions of the paintings written by the art historian Bertel Hintze, who published a catalogue of Edelfelt’s art in 1942. These texts are licensed with a Creative Commons Attribution-NoDerivs 4.0 Generic (CC BY-ND 4.0). Edelfelt’s letters as well as the texts and metadata produced by the editorial staff at the Society of Swedish Literature in Finland have a Creative Commons CC0 1.0 Universal-license. Data with Creative Commons-license is also freely available as open data through a REST API (http://edelfelt.sls.fi/apiinfo/).

In the future, we would like to find a common practice for the user rights; if possible, even so all the material would have the same license. We intend to invite other institutions with artworks of Edelfelt to co-operate, offering the same kind of partnership as the web publication has with the Finnish National Gallery. Thus, we are striving to a complete as possible site with the artworks of Edelfelt.

Albert Edelfelt is of national interest and his letters, which he mostly wrote during his stays abroad, contain information of international interest. Therefore, we plan to offer the metadata and at least some of the source material in Finnish and English translations. So far, the letters are only available as facsimile. The development of transcription programs for handwritten texts has made it probable that we in the future could include transcriptions of the letters in the web publication. Linguists especially have an interest in getting a searchable letter transcription for their researches, and the transcriptions would even be helpful for users who might have problem reading the handwritten text.

Poster [abstract]

Metadata Analysis and Text Reuse Detection: Reassessing public discourse in Finland through newspapers and journals 1771–1917

Filip Ginter¹, Antti Kanner², Leo Lahti¹, Jani Marjanen², Eetu Mäkelä², Asko Nivala¹, Heli Rantala¹, Hannu Salmi¹, Reetta Sippola¹, Mikko Tolonen², Ville Vaara², Aleksi Vesanto²

¹University of Turku; ²University of Helsinki

During the period 1771–1917 newspapers developed as a mass medium in the Grand Duchy of Finland. This happened within two different imperial configurations (Sweden until 1809 and Russia 1809–1917) and in two main languages – Swedish and Finnish. The Computational History and the Transformation of Public Discourse in Finland, 1640–1910 (COMHIS) project studies the transformation of public discourse in Europe and in Finland via an innovative combination of original data, state-of-the-art quantitative methods that have not been previously applied in this context, and an open source collaboration model.

In this study the project combines the statistical analysis of newspaper metadata and the analysis of text reuse within the papers to trace the expansion of and exchange in Finnish newspapers published in the long nineteenth century. The analysis is based on the metadata and content of digitized Finnish newspapers published by the National library of Finland. The dataset includes full text of all newspapers and most periodicals published in Finland between 1771 and 1920. The analysis of metadata builds on data harmonization and enrichment by extracting information on columns, type sets, publications frequencies and circulation records from the full-text files or outside sources. Our analysis of text reuse is based on a modified version of the Basic Local Alignment Search Tool (BLAST) algorithm, which can detect similar sequences and was initially developed for fast alignment of biomolecular sequences, such as DNA chains. We have further modified the algorithm in order to identify text reuse patterns. BLAST is robust to deviations in the text content, and as such able to effectively circumvent errors or differences arising from optical character recognition (OCR).

By relating metadata on publication places, language, number of issues, number of words, size of papers, and publishers and comparing that to the existing scholarship on newspaper history and censorship, the study provides a more accurate bird’s-eye view of newspaper publishing in Finland after 1771. By pinpointing key moments in the development of journalism the study suggest that the while the discussions in the public were inherently bilingual, the technological and journalistic developments advanced at different speeds in Swedish and Finnish language forums. It further assesses the development of the press in comparison with book production and periodicals, pointing towards a specialization of newspapers as a medium in the period post 1860. Of special interest is that the growth and specialization of the newspaper medium was much indebted to the newspapers being established all over the country and thus becoming forums for local debates.

The existence of a medium encompassing the whole country was crucial to the birth of a national imaginary. Yet, the national public sphere was not without regional intellectual asymmetries. This study traces these asymmetries by analysing text reuse in the whole newspaper corpus. It shows which papers and which cities functioned as “senders” and “receivers” in the public discourse in this period. It is furthermore essential that newspapers and periodicals had several functions throughout the period, and the role of the public sphere cannot be taken for granted. The analysis of text reuse further paints a picture of virality in newspaper publishing that was indicative of modern journalistic practices but also reveals the rapidly expanding capacity of the press. These can be further contrasted to other items commonly associated with the birth of modern journalism such as publication frequency, page sizes and typesetting of the papers.

All algorithms, software, and the text reuse database will be made openly available online, and can be located through the project’s repositories (https://comhis.github.io/ and https://github.com/avjves/textreuse-blast). The results of the text reuse detection carried out in BLAST are stored in a database and will also be made available for the exploration of other researchers.

Poster [abstract]

Oceanic Exchanges: Tracing Global Information Networks In Historical Newspaper Repositories, 1840-1914

Hannu Salmi, Mila Oiva, Asko Nivala, Otto Latva

University of Turku,

Oceanic Exchanges: Tracing Global Information Networks in Historical Newspaper Repositories, 1840-1914 (OcEx) is a Digging into Data – Transatlantic Platform funded international and interdisciplinary project with a focus on studying spreading of news globally in the nineteenth century newspapers. The project combines digitized newspapers from Europe, US, Mexico, Australia, New Zealand, and the British and Dutch colonies of that time all over the world.

The project examines patterns of information flow, spread of text reuse, and global conceptual changes across national, cultural and linguistic boundaries in the nineteenth century newspapers. The project links the different newspaper corpora, scattered into different national libraries and collections using various kinds of metadata and printed in several languages, into one whole.

The project proposes to present a poster in the Nordic Digital Humanities Conference 2018. The project started in June 2017, and the aim of the poster is to present the current status of the project.

The research group members come from Finland, the US, the Netherlands, Germany, Mexico, and UK. OcEx’s participating institutions are Loughborough University, Northeastern University, North Carolina State University, Universität Stuttgart, Universidad Nacional Autónoma de México, University College London, University of Nebraska-Lincoln, University of Turku, and Utrecht University. The project’s 90 million newspaper pages come from Australia's Trove Newspapers, the British Newspapers Archive, Chronicling America (US), Europeana Newspapers, Hemeroteca Nacional Digital de México, National Library of Finland, National Library of the Netherlands (KB), the National Library of Wales, New Zealand’s PapersPast, and a strategic collaboration with Cengage Publishing, one of the leading commercial custodians of digitized newspapers.

Objectives

Our team will hone computational tools, some developed in prior research by project partners and novel ones, into a suite of openly available tools, data, and analyses that trace a broad range of language-related phenomena (including text reuse, translational shifts, and discursive changes). Analysing such parameters enables us to characterize “reception cultures,” “dissemination cultures,” and “reference cultures” in terms of asymmetrical flow patterns, or to analyse the relationships between reporting targeted at immigrant communities and their surrounding host countries.

OcEx will leverage existing relationships and agreements between its teams and data providers to connect disparate digital newspaper collections, opening new questions about historical globalism and modeling consortial approaches to transnational newspaper research. OcEx will take up challenging questions of historical information flow, including:

1. Which stories spread between nations and how quickly?

2. Which texts were translated and resonated across languages?

3. How did textual copying (reprinting) operate internationally compared to conceptual copying (idea spread)?

4. How did the migration of texts facilitate the circulation of knowledge, ideas, and concepts, and how were these ideas transformed as they moved from one Atlantic context to another?

5. How did geopolitical realities (e.g. economic integration, technology, migration, geopolitical power) influence the directionality of these transnational exchanges?

6. How does reporting in immigrant and ethnic communities differ from reporting in surrounding host countries?

7. Does the national organization of digitized newspaper archives artificially foreclose globally-oriented research questions and outcomes?

Methodology

OcEx will develop a semantic interoperable knowledge structure, or ontology, for expressing thematic and textual connections among historical newspaper archives. Even with standards in place, digitization projects pursue differing approaches that pose challenges to integration or particular levels of analysis. In most, for instance, generic identification of items within newspapers has not been pursued. In order to build an ontology, this project will build on knowledge acquired by participating academic partners, such as the project TimeCapsule at Utrecht University, as well as analytical software that has been tested and used by team members, such as viral text analysis. OcEx does not aim to create a totalizing research infrastructure but rather to expose the conditions by which researchers can work across collections, helping guide similar projects in future seeking to bridge national collections. This ontology will be established through comparative investigations of phenomena illustrating textual links: reprinting and topic dissemination. We have divided the tasks into six work packages:

WP1: Management

➢ create an international network of researchers to discuss issues of using and accessing newspaper repository data and combine expertise toward better development and management of such data;

➢ assemble a project advisory board, consisting of representatives of public and private data custodians and other critical stakeholders.

WP2: Assessment of Data and Metadata

➢ investigate and develop classifier models of the visual features of newspaper content and genres;

➢ create a corpus of annotations on clusters/passages that records relationships among textual versions.

WP3: Creating a Networked Ontology for Research

➢ create an ontology of genres, forms, and elements of texts to support that annotation;

➢ select and develop best practices based on available technology (semantic web standard RDF, linked data, SKOS, XML markup standards such as TEI).

WP4: Textual Migration and Viral Texts

➢ analyze text reuse across archives using statistical language models to detect clusters of reprinted passages;

➢ perform analyses of aggregate information flows within and across countries, regions, and publications;

➢ develop adaptive visualization methods for results.

WP5: Conceptual Migration and Translation Shifts

➢ perform scalable multilingual topic model inference across corpora to discern translations, shared topics, topic shifts, and concept drift within and across languages, using distributional analysis and (hierarchical) polylingual topic models;

➢ analyze migration and translation of ideas over regional and linguistic borders;

➢ develop adaptive visualization methods for the results.

WP6: Tools of Delivery/Dissemination

➢ validation of test results in scholarly contexts/test sessions at academic institutions;

➢ conduct analysis of the sensitivity of results to the availability of corpora in different languages and levels of access;

➢ share findings (data structures/availability/compatibility, user experiences) with institutional partners;

➢ package code, annotated data (where possible), and ontology for public release.

Poster [abstract]

ArchiMob: A multidialectal corpus of Swiss German oral history interviews

Yves Scherrer¹, Tanja Samardžić²

¹University of Helsinki, Department of Digital Humanities; ²University of Zurich, CorpusLab, URPP Language and Space

Although dialect usage is prevalent in the German-speaking part of Switzerland, digital resources for dialectological and computational linguistic research are difficult to obtain. In this paper, we present a freely available corpus of spontaneous speech in various Swiss German dialects. It consists in transcriptions of video interviews with contemporary witnesses of the Second World War period in Switzerland. These recordings were produced by an association of Swiss historians called Archimob about 20 years ago. More than 500 informants stemming from all linguistic regions of Switzerland (German, French and Italian) and representing both genders, different social backgrounds, and different political views, were interviewed. Each interview is 1 to 2 hours long. In collaboration with the University of Zurich, we have selected, processed and analyzed a subset of 43 interviews in different Swiss German dialects.

The goal of this contribution is twofold. First, we describe how the documents were transcribed, segmented and aligned with the audio source and how we make the data available on specifically adapted corpus query engines. We also provide an additional normalization layer in order to reduce the different types of variation (dialectal, speaker-specific and transcriber-specific) present in the transcriptions. We formalize normalization as a machine translation task, obtaining up to 90% of accuracy (Scherrer & Ljubešić 2016).

Second, we show through some examples how the ArchiMob resource can shed new lights on research questions from digital humanities in general and dialectology and history in particular:

• Thanks to the normalization layer, dialect differences can be identified and compared with existing dialectological knowledge.

• Using language modelling, another technique borrowed from language technology, we can compute distances between texts. These distance measures allow us to identify the dialect of unknown utterances (Zampieri et al. 2017), localize transcriber effects and obtain a generic picture of the Swiss German dialect landscape.

• Departing from the purely formal analysis of the transcriptions for dialectological purposes, we can apply methods such as collocation analysis to investigate the content of the interviews. By identifying the key concepts and events referred to in the interviews, we can assess how the different informants perceive and describe the same time period.

Poster [abstract]

Serious gaming to support stakeholder participation and analysis in Nordic climate adaptation research

Tina-Simone Neset¹, Sirkku Juhola², Therese Asplund¹, Janina Käyhkö², carlo Navarra¹

¹Linköping University,; ²Helsinki University

Introduction

While climate change adaptation research in the Nordic context has advanced significantly in recent years, we still lack a thorough discussion on maladaptation, i.e. the unintended negative outcomes as a result of implemented adaptation measures. In order to identify and assess examples of maladaptation for the agricultural sector, we developed a novel methodology, integrating visualization, participatory methods and serious gaming. This enables research and policy analysis of trade-offs between mitigation and adaptation options, as well as between alternative adaptation options with stakeholders in the agricultural sector. Stakeholders from the agricultural sector in Sweden and Finland have been engaged in the exploration of potential maladaptive outcomes of climate adaptation measures by means of a serious game on maladaptation in Nordic agriculture, and discussed their relevance and related trade offs.

The Game

The Maladaptation Game is designed as a single player game. It is web-based and allows a moderator to collect the settings and results for each player involved in a session, store these for analysis, and display these results on a ‘moderator screen’. The game is designed for agricultural stakeholders in the Nordic countries, and requires some prior understanding of the challenges that climate change can impose on Nordic agriculture as well as the scope and function of adaptation measures to address these challenges.

The gameplay consists of four challenges, each involving multiple steps. At the start of the game, the player is equipped with a limited number of coins, which decrease for each measure that is selected. As such, the player has to consider the implications in terms of risk and potential negative effects of a selected measure as well as the costs for each of these measures. The player is challenged with four different climate related challenges – increased precipitation, drought, increased occurrence of pests and weeds, and a prolonged growing season - that are all relevant to Nordic agriculture. The player selects one challenge at a time. Each challenge has to be addressed, and once a challenge has been concluded, the player cannot return and revise the selection. When entering a challenge (e.g. precipitation) possible adaptation measures that can be taken to address this challenge in an agricultural context, are displayed as illustrated cards on the game interface. Each card can be turned to receive more information, i.e. a descriptive text and the related costs. The player can explore all cards before selecting one. The selected adaptation measure is then leading to a potential maladaptive outcome, which is again displayed as an illustrated card with an explanatory text on the backside. The player has to decide to reject or accept this potential negative outcome. If the maladaptive outcome is rejected, the player returns to the previous view, where all adaptation measures for the current challenge are displayed, and can select another measure, and make the decision whether to accept or reject the potential negative outcome that is presented for these. In order to complete a challenge, one adaptation measure with the related negative outcome has to be accepted. After completing a challenge, the player returns to the entry page, where, in addition to the overview of all challenges, a small scoreboard summarizes the selection made, displays the updated amount of coins as well as a score of maladaptation-points. These points represent the negative maladaptation score for the selected measures and are a measure that the player does not know prior to making the decision.

The game continues until selections have been made for all four challenges. At the end of the game, the player has an updated scoreboard with three main elements: the summary of the selections made for each challenge, the remaining number of coins, and the total sum of the negative maladaptation score. The scoreboards of all players involved in a session appear now on the moderator screen. This setup allows the individual player to compare his or her pathways and results with other players. The key feature of the game is hence the stimulation of discussions and reflections concerning adaptation measures and their potential negative outcomes, both with regard to adding knowledge about adaptation measures and their impact as well as the threshold of when an outcome is considered maladaptive, i.e. what trade offs are made within agricultural climate adaptation.

Preliminary conclusions from the visualization supported gaming workshops

During autumn 2016, eight gaming workshops were held in Sweden and Finland. These workshops were designed as visualization supported focus groups, allowing for some general reflections, but also individual interaction with the web-based game. Stakeholders included farmers, agricultural extension officers, and representatives of branch organizations as well as agricultural authorities on the national and regional level. Focus group discussions were recorded and transcribed in order to analyze the empirical results with focus on agricultural adaptation and potential maladaptive outcomes.

Preliminary conclusions from these workshops point towards several issues that relate both to content and functionality of the game. While, as a general conclusion, the stakeholders were able to quickly get acquainted with the game and interact without larger difficulties, some few individual participants were negative to the general idea of engaging with a game to discuss these issues. The level of interactivity that the game allows, where players can test and explore, before making a decision, enabled reflections and discussions also during the gameplay. Stakeholders frequently tested and returned to some of the possible choices before deciding on their final setting. Since the game demands the acceptance of a potential negative outcome, several stakeholders described their impression of the game as a ‘pest or cholera’ situation. In terms of empirical results, the workshops generated a large number of issues regarding the definition of maladaptive outcomes and their thresholds, in relation to contextual aspects, such as temporal and spatial scales, as well as reflections regarding the relevance and applicability of the proposed adaptation measures and negative outcomes.

Poster [abstract]

Challenges in textual criticism and editorial transparency

Elisa Johanna Veit, Pieter Claes, Per Stam

Svenska litteratursällskapet i Finland,

Henry Parlands Skrifter (HPS) is a digital critical edition of the works and correspondence of the modernist author Henry Parland (1908–1930). The poster presents chosen strategies for communicating the results of the process of textual criticism in a digital environment. How can we make the foundations for editorial decisions transparent and easily accessible to a reader?

Textual criticism is by one of several definitions “the scientific study of a text with the intention of producing a reliable edition” (Nationalencyklopedin, “textkritik”. Our translation.) When possible, the texts of the HPS edition are based on original prints whose publication was initiated by the author during his lifetime. However, rendering a reliable text largely requires a return to original manuscripts as only a fraction of Parland’s works were published before the author’s death at the age of 22 in 1930. Posthumous publications often lack reliability due to the editorial practices and sometimes primarily aesthetic solutions to text problems of later editors.

The main structure of the Parland digital edition is related to Zacharias Topelius Skrifter (topelius.sls.fi) and similar editions (e.g. grundtvigsværker.dk). However, the Parland edition has foregone the system of a – theoretically – unlimited amount of columns in favour of only two fields for text: a field for the reading text, which holds a central position on the webpage, and a smaller, optional, field containing, in different tabs, editorial commentary, facsimiles and transcriptions of manuscripts and original prints. The benefit of this approach is easier navigation. If a reader wishes to view several fields at once, they may do so by using several browser windows, which is explained in the user’s guide.

The texts of the edition are transcribed in XML and encoded following TEI (Text Encoding Initiative) Guidelines P5. Manuscripts, or original prints, and edited reading texts are rendered in different files (see further below). All manuscripts and original prints used in the edition are presented as high-resolution facsimiles. The reader thus has access to the different versions of the text in full, as a complement to the editorial commentary.

Parland’s manuscripts often contain several layers of changes (additions, deletions, substitutions): those made by the author himself during the initial process of writing or during a later revision, and those made by posthumous editors selecting and preparing manuscripts for publication. The editor is thus required to analyse the manuscripts in order to include only changes made by the author in the text of the edition. The posthumous changes are included in the transcriptions of the manuscripts and encoded using the same TEI elements as the author’s changes with an addition of attributes indicating the other hand and pen (@hand and @medium). In the digital edition these changes, as well as other posthumous markings and notes, are displayed in a separate colour. A tooltip displays the identity of the other hand.

One of the benefits of this solution is transparency towards the reader through visualization of the editor’s interpretation of all sections of the manuscript. The using of standard TEI elements and attributes facilitate possible use of the XML-documents for purposes outside of the edition. For the Parland project, there were also practical benefits concerning technical solutions and workflow in using mark-up that had already, though to a somewhat smaller extent, been used by the Zacharias Topelius edition.

The downside to using the same elements for both authorial and posthumous changes is that the XML-file will not very easily lend itself to a visualization of the author’s version. Although this surely would not be impossible with an appropriately designed stylesheet, we have deemed it more practical to keep manuscripts and edited reading texts in separate files. All posthumous intervention and associated mark-up are removed from the edited text, which has the added practical benefit of making the XML-document more easily readable to a human editor. However, the information value of the separate files is more limited than that of a single file would be.

The file with the edited text still contains the complete author’s version, according to the critical analysis of the editor. Editorial changes to the author’s text are grouped together with the original wording in the TEI-element choice and the changes are visualized in the digital edition. The changed section is highlighted and the original wording displayed in a tooltip. Thus, the combination of facsimile, transcription and edited text in the digital edition visualizes the editor’s source(s), interpretation and changes to the text.

Sources

Nationalencyklopedin, “textkritik”. http://www.ne.se/uppslagsverk/encyklopedi/lång/textkritik (accessed 2017-10-19).

Poster [publication ready]

Digitizing the Icelandic-Danish Blöndal Dictionary

Steinþór Steingrímsson

The Árni Magnússon Institute for Icelandic Studies, Iceland,

The Icelandic-Danish dictionary, compiled by Sigfús Blöndal in the early 20th century is being digitized. It is the largest dictionary ever published in Icelandic, containing in total more than 150,000 entries. The digitization work started with a pilot project in 2016 resulting in a comprehensive plan on how to carry out the task. The paper describes the ongoing work, methods and tools applied as well as the aim of the project and rationale. We opted for using OCR and not double-keying, which has become common for similar projects. First results suggest the outcome is satisfactory, as the final version will be proofread. The entries are annotated with XML-entities, using a workbench built for the project. We apply automatic annotation for the most consistent entities, but other annotation is carried out manually. The data is then exported into a relational database, proofread and finally published. Publication date is set for spring 2020.

Poster [abstract]

Network visualization for historical corpus linguistics: externally-defined variables as node attributes

Timo Korkiakangas

University of Oslo,

In my poster presentation, I will explore whether and how network visualization can benefit philological and historical-linguistic research. This will be implemented by examining the usability of network visualization for the study of early medieval Latin scribes' language competences. Thus, the scope is mainly methodological, but the proposed methodological choices will be illustrated by applying them to a real data set. Four linguistic variables extracted corpus-linguistically from a treebank will be examined: spelling correctness, classical Latin prepositions, genitive plural form, and <ae> diphthong. All the four are continuous, which is typical of linguistic variables. The variables represent different domains of language competence of the scribes who learnt written Latin practically as a second-language by that time. Even more linguistic features will be included in the analysis if my ongoing project proceeds as planned.

Thus, the primary objective of the study is to find out whether the network visualization approach has demonstrable advantages compared to ordinary cross-tabulations as far as support to philological and historical-linguistic argumentation is concerned. The main means of visualization will be the gradient colour palette in Gephi, a widely used open-source network analysis and visualization software package. As an inevitable part of the described enterprise, it is necessary to clarify the scientific premises for the use of network environment to display externally-defined values of linguistic variables. It is obvious that in order to be utilized for research purposes, network visualization must be as objective and replicable as possible.

By way of definition, I emphasize that the proposed study will not deal with linguistic networks proper, i.e. networks which are directly induced or synthesized from a linguistic data set and represent abstract relations between linguistic units. Consequently, no network metric will be calculated, even though that might be interesting as such. What will be visualized are the distributions of linguistic variables that do not arise from the network itself, but are derived externally from a medium-sized treebank by exploiting its lemmatic, morphological, and, hopefully, also syntactic annotation layers. These linguistic variables will be visualized as attributes of the nodes in the trimodal "social" network which consists of the documents, persons, and places that underlie the treebank. These documents, persons, and places are encoded as the metadata in the treebank. The nodes are connected to each other by unweighted edges. The number of document nodes is 1,040, scribe nodes 220, and writing place nodes 84. In most cases, the definition of the 220 writer nodes is straightforward, given that the scribes scrupulously signed what they wrote, with the exception of eight documents. The place nodes are more challenging. Although 78% of the documents has been written in the city of Lucca, the disambiguation and re-grouping of small localities of which little is known was time-consuming and the results not always fully satisfying. The nodes will be set on the map background by utilizing Gephi's Geo Layout and Force Atlas 2 algorithms.

The linguistic features that will be visualized reflect the language change that took place in late Latin and early medieval Latin, roughly the 3rd to 9th centuries AD. The features are operationalized as variables which quantify the variation of those features in the treebank. This quantification is based on the numerical output of a plethora of corpus-linguistic queries which extract from the treebank all constructions or forms that meet the relevant criteria. The variables indicate the relative frequency of the examined features in each document, scribe, and writing place. For the scribes and writing places, the percentages are calculated by counting the occurrences within all the documents written by that scribe or in that place, respectively.

The resulting linguistic variables are continuous, hence the practicality of the gradient colouring. In order to ground colouring in the statistical dispersion of the variable values and to conserve maximal visual effect, I customize the Gephi default red-yellow-blue palette so that the maximal yellow, which stands for the middle of the colour scale, marks the mean of the distribution of each variable. Likewise, the thresholds of the maximal red and maximal blue are set equally far from the mean. I chose that distance to be two standard deviations away from the mean. In this way, only around 2.5% of the nodes with the lowest and highest values at both ends of the distribution are maximally saturated with red and blue while the rest, around 95%, of the nodes features a gradient colour, including the maximal yellow in the between. Following this rule, I will illustrate the variables both separately and as a sum variable. The images will be available in the poster. The sum variable will be calculated by aggregating the standardized simple variables.

The preliminary conclusions include the observation that network visualization, as such, is not a sufficient basis for philological or historical-linguistic argumentation, but if used along with statistical approach, it can support argumentation by drawing attention to unexpected patterns and – on the other hand – to irregularities. However, it is the geographical layout of the graphs that gives the most of the surplus in regard to traditional approaches: it helps in perceiving patterns that would have otherwise failed to be noticed.

The treebank on which the analyses are based is the Late Latin Charter Treebank (version 2, LLCT2), which consists of 1,040 early medieval Latin documentary texts (c. 480,000 words). The documents have been written in historical Tuscia (Tuscany), Italy, between AD 714 and 897, and are mainly sale or purchase contracts or donations, accompanied by a few judgements as well as lists and memoranda. LLCT2 is still under construction and only the first half of it is already provided with the syntactically annotated layer, thus making it a treebank proper (i.e. LLCT, version 1). The lemmatization and morphological annotation style are based on the Ancient Greek and Latin Dependency Treebank (AGLDT) style which can be deduced from the Guidelines for the Syntactic Annotation of Latin Treebanks. Korkiakangas & Passarotti (2011) define a number of additions and modifications to these general guidelines which are designed for Classical Latin. For a more detailed description of the LLCT2 and the underlying text editions, see Korkiakangas (in press). Documents are privileged material for examining the spoken/written interface of early medieval Latin, in which the distance between the spoken and written codes had grown considerable by the Late Antiquity. The LLCT2 documents have precise dating and location metadata and they survive as originals.

Bibliography

Adams J.N. Social variation and the Latin language. Cambridge University Press (Cambridge), 2013.

Araújo T. and Banisch S. Multidimensional Analysis of Linguistic Networks. Mehler A., Lücking A., Banisch S., Blanchard P. and Job, B. (eds) Towards a Theoretical Framework for Analyzing Complex Linguistic Networks. Springer (Berlin, Heidelberg), 2016, 107-131.

Bamman D., Passarotti M., Crane G. and Raynaud S. Guidelines for the Syntactic Annotation of Latin Treebanks (v. 1.3), 2007 http://nlp.perseus.tufts.edu/syntax/treebank/ldt/1.5/docs/guidelines.pdf.

Barzel B. and Barabási A.-L. Universality in network dynamics. Nature Physics. 2013;9:673-681.

Bergs A. Social Networks and Historical Sociolinguistics: Studies in Morphosyntactic Variation in the Paston Letters. Walter de Gruyter (Berlin), 2005.

Ferrer i Cancho R. Network theory. Hogan P.C. (ed.) The Cambridge Encyclopedia of the Language Sciences. Cambridge University Press (Cambridge), 2010, 555–557.

Korkiakangas T. (in press) Spelling Variation in Historical Text Corpora: The Case of Early Medieval Documentary Latin. Digital Scholarship in the Humanities.

Korkiakangas T. and Lassila M. Abbreviations, fragmentary words, formulaic language: treebanking medieval charter material. Mambrini F., Sporleder C. and Passarotti M. (eds) Proceedings of the Third Workshop on Annotation of Corpora for Research in the Humanities (ACRH-3), Sofia, December 13, 2013. Bulgarian Academy of Sciences (Sofia), 2013, 61-72.

Korkiakangas T. and Passarotti M. Challenges in Annotating Medieval Latin Charters. Journal of Language Technology and Computational Linguistics. 2011;26,2:103-114.

Poster [abstract]

Approaching a digital scholarly edition through metadata

Katarina Pihlflyckt

Svenska litteratursällskapet i Finland r.f.

This poster presents a flowchart with an overview of the database structure in the digital critical edition of Zacharias Topelius Skrifter (ZTS). It shows how the entity relations open a possibility for the user to approach the edition from other angles than the texts, using informative metadata through indexing systems. Through this data, a historian can easily capture for example events, meetings between people or editions of books, as they are presented in Zacharias Topelius’ (1818–1898) texts. Presented here are both already available features and features in progress.

ZTS comprises eight digital volumes hitherto, the first published in 2010. This includes the equivalent of about 8 500 pages of text by Topelius, 600 pages of introduction by editors and 13 000 annotations. The published volumes cover poetry, short stories, correspondences, children’s textbooks, historical-geographical works and university lectures on history and geography. It is freely accessible at topelius.sls.fi. Genres still to be published include children’s books, novels, journalism, academica, diaries and religious texts.

DATABASE STRUCTURE

The ZTS database structure consists of six connected databases: people, places, bibliography, manuscripts, letters and a chronology. So far, the people database consists of about 10 000 unique persons, and a possibility to link them to a family or group level (250 records). It has separate chapters for mythological persons (500 records) and fictive characters (250 records). The geographic database has 6 000 registered places. The bibliographic database has 6 000 editions divided on 3 500 different works, and the manuscript database has 1 400 texts on 350 physical manuscripts. The letter database has 4 000 registered letters to and from Topelius, divided on 2 000 correspondences. The chronology of Topelius life has 7 000 marked events. The indexing of objects started in 2005, using the FileMaker system. New records are continuously added and the work with finding more possibilities on how to use, link and present the data is in constant progress. The users can freely access the information in database records that link to the published volumes.

The bibliographic database is the most complex database. The structure follows the Functional Requirements for Bibliographic Records (FRBR) model, which means we are making a difference between the abstract work and the published manifestations (editions) of that work. The FRBR focuses on the content relationship and continuum between the levels; anything regarded a separate work starts as a new abstract record, from where its own editions are created. Within ZTS, the abstract level has a practical significance, in cases when it is impossible to determine to which exact edition Topelius is referring. Also taken in consideration is that for example articles and short stories can have their own independent editions as well as being included in editions (e.g. a magazine, an anthology). This requires two different manifestation levels subordinated the abstract level; the regular editions and the texts included in other editions, the records of the latter type must always link to records of the former.

The manuscript database has a content relationship to the bibliographic database through the abstract entity of a work. A manuscript text can be regarded as an independent edition of a work in this context (a manuscript that was never published can easily have a future edition added in the bibliographic database). The manuscript text itself might share physical paper with another manuscript text. Therefore, the description of the physical manuscript is created on a separate level in the manuscript database, to which the manuscript text is connected.

The letter database follows the FRBR model; an upper level presents the whole correspondence between Topelius and another person, and a subordinated level describes each physical letter within the correspondence. It is possible to attach additional corresponding persons to occasional letters.

The people database connects to the letter database and the bibliographic database, creating a one-to-many relationship. Any writer or author has to be in the people database in order to have their information inserted into these two databases. Within the people database there is also a family or group level, where family members can be grouped, but in contrary to the letter database, this is not a superordinate level.

The geographic database follows a one-level structure. Places in letters and manuscripts can be linked from the geographic database.

The chronology database contains manually added key events from Topelius’ life, as well as short diary entries made by him in various calendars during his life. It also has automatically gathered records from other databases, based on marked dates when Topelius works were published or when he wrote a letter or a manuscript. The dates of birth and/or death of family members and close friends can be linked from the people database.

POSSIBILITIES FOR THE USER

Approaching a digital scholarly edition with over 8 500 pages can be a heavy task, and many will likely use the edition more as an object to study, rather than texts to read. For a user not familiar with the content of the different volumes, but still looking for specific information, advanced searches and indexing systems offer a faster path into the relevant text passages. The information in the ZTS database records provides a picture of Finland in the 19th century as it appears in Topelius’ works and life. A future feature for users is access to this data through an API (Application Programming Interface). This will create opportunities for the user to take advantage of the data in any wanted way: to create a 19th century bookshelf, an app for the most popular 19th century names or a map of popular student hangouts in 1830’s Helsinki.

Through the indexes formed by the linked data from the texts, the user can find all the occurrences of a person, a place or a book in the whole edition. One record can build a set of ontological relations, and the user can follow a theme, while moving between texts. A search for a person will provide the user with information about where Topelius mentions this person, whether it is in a letter, in his diaries or in a textbook for schoolchildren, or if he possibly meets or interacts with the person. Furthermore, the user can see if this person was the author, publisher or perhaps translator of a book mentioned by Topelius in his texts, or if the editors of ZTS have used the book as a source for editorial comments. The user will also be able to get a list of letters the person wrote to or received from Topelius. The geographic index can help the user create a geographic ontology with an overview of Topelius’ whereabouts through the annotated mentions of places in Topelius’ diaries, letters and manuscripts.

The chronology creates a base for a timeline that will not only give the user key events from Topelius’ life, but also links to the other database records. Encoded dates in the XML files (letters, diaries, lectures, manuscripts etc.) can lead the user directly to the relevant text passages.

The relation between the bibliographic database and the manuscript database creates a complete bibliography over everything Topelius wrote, including all known manuscripts and editions that relate to a specific work. So far, there are 900 registered independent works by Topelius in the bibliographic database; these works are implemented in 300 published editions (manifestations) and 2 900 text versions included in those manifestations or in other independent manifestations. The manuscript database consists of 1 400 manuscript texts. The FRBR model offers different ways of structuring the layout of a bibliography according to the user’s needs, either through the titles of the abstract works with subordinate manifestations, or directly through the separate manifestations. The bibliography can be limited to show only editions published during Topelius’ lifetime, or to include later editions as well. Furthermore, the bibliography points the user to the published texts and manuscripts of a specific work in the ZTS edition and to text passages where the author himself discusses the work in question.

The level of detail is high in the records. For example, we register different name forms and spellings (Warschau vs Warszawa). Such information is included in the index search function and thereby eliminates problems for the end user trying to find information. Topelius often uses many different forms and abbreviations, and performing an advanced search in the texts would seldom give a comprehensive result in these cases. The letter database includes reference words describing the contents of the correspondences. Thus, the possibilities for searching in the material are expanded beyond the wordings of the original texts.

Poster [publication ready]

A Tool for Exploring Large Amounts of Found Audio Data

Per Fallgren, Zofia Malisz, Jens Edlund

KTH Royal Institute of Technology,

We demonstrate a method and a set of open source tools (beta) for non-sequential browsing of large amounts of audio data. The demonstration will contain first versions of a set of varied functionalities in their first stages, and will provide a good insight in how the method can be used to browse through large quantities of audio data efficiently.

Poster [publication ready]

The PARTHENOS Infrastructure

Sheena Dawn Bassett

PIN SCrl,

PARTHENOS around two ERICs from the Humanities and Arts sector, DARIAH and CLARIN, along with ARIADNE, EHRI, CENDARI, CHARISMA and IPERION-CH and will deliver guidelines, standards, methods, pooled services and tools to be used by its partners and all the research community. Four broad research communities are addressed – History, Linguistic Studies, Archaeology, Heritage and Applied Disciplines and the Social Sciences. By identifying the common needs, PARTHENOS will support cross disciplinary research and provide innovative solutions.

By applying the FAIR data principles to structure the work on common policies and standards, the project has produced tools to assist researchers to find and apply the appropriate ones for their areas of interest. A virtual research environment will enable the discovery and use of data and tools and further support is provided with a set of online training modules.

Poster [abstract]

Using rolling.classify on the Sagas of Icelanders: Collaborative Authorship in Bjarnar saga Hítdælakappa

Daria Glebova

Russian State Academy of Science, Institute of Slavonic Studies

This poster will present the results of an application of the rolling.classify function in Stylo (R) to the source with an unknown authorship and extremely poor textual history – Bjarnar saga Hítdælakappa, one of the medieval Sagas of Icelanders. This case study sets the usual for Stylo authorship attribution goal aside and concentrates on the composition of the main witness of Bjarnar saga, ms. AM 551 d α, 4to (17th c.), which was the source for the most of Bjarnar saga existing copies. It aims not only to find and visualise new arguments for the working hypothesis about the AM 551 d α, 4to composition but also to touch upon main questions that rise before a student of philology daring to use Stylo on the Old Icelandic saga ground, i.e. what Stylo tells us, what it does not, and how can one use it while exploring the history of a text that exists only in one source.

It has been noticed that Bjarnar saga shows signs of a stylistic change between the first 10 chapters and the rest of the saga – the characters suddenly change their behaviour (Sígurður Nordal 1938, lxxix; Andersson 1967, 137-140), the narrative becomes less coherent and, as it seems, acquires a new logic of construction (Finlay 1990-1993, 165-171). More detailed narrative analysis of the saga showed that there is a difference in the usage of some narrative techniques in the first and the second parts, i.e., for example, the narrator’s work with point of view and the amount of their intervention in the saga text (Glebova 2017, 45-57). Thus, the question is – what is the relationship between the first 10 chapters and the rest of Bjarnar saga? Is the change entirely compositional and motivated by the narrative strategy of the medieval compiler or it is actually a result of a compilation of two texts that have two different authors?

As it often happens with sagas, the problem aggravates due to the Bjarnar saga poor preservation. There is not much to compare and work with; the most of the saga witnesses are copies from one 17th c. manuscript, AM 551 d α, 4to (Boer 1893, xii-xiv; Sígurður Nordal 1938, xcv-xcvii; Simon 1966 (I), 19-149). This manuscript also has its flaws as it has two lacunae, one in the very beginning of the saga (ch. 1-5,5 in ÍF III) and another in the middle (between ch. 14-15 in ÍF III). The second lacuna is unreconstructable while the first one is usually substituted by a fragment from the saga’s short reduction that was preserved in copies of 15th c. kings’ saga compilation, Separate saga St. Olaf in Bœjarbók (Finlay 2000, xlvi), and that actually ends right on the 10th chapter of the longer version. It seems that the text of the shorter version is a variant of the longer one (Glebova 2017, 13-17) and it has a reference that there has been more to the story but it was shortened; precise relationships between the short and long reductions, however, are impossible to reconstruct due to the lacuna in AM 551 d α, 4to. The existence of the short version with these particular length and contents is indeed very important to the study of Bjarnar saga composition in AM 551 d α, 4to as it creates a chance that the first 10 chapters of AM 551 d α, 4to could exist separately at some point of the Bjarnar saga’s text history or at least that these chapters were seen by the medieval compilers as something solid and complete. This would be the last word of the traditional philology concerning this case – the state of the sources does not allow saying more. Thus, is there anything else that could shed some light on the question whether these chapters existed separately or they were written by the same hand?

In this study it was decided to try sequential stylometric analysis available in Stylo package for R (Eder, Kestemont, Rybicki 2013) as a function rolling.classify (Eder 2015). As we are interested in the different parts of the same text, rolling stylometry seems to be a more preferable method than cluster analysis, which takes the whole text as an entity and compares it to the reference corpus; alternatively, in case with rolling stylometry the text is divided into smaller segments that allows a deeper investigation of the stylistic variation in the text itself (Rybicki, Eder, Hoover 2016, 126). To do the analysis there was made a corpus from the two parts of Bjarnar saga and several other Old Icelandic sagas; the whole corpus was taken from sagadb.org in Modern Icelandic normalised orthography. Several tests were conducted, first, with one of the parts as a test set and then with another; a sample size from 5000 words to 2000. The preliminary results show that there is a stylistic division in the saga as the style of the first part is not present in the second one and vice versa.

This would be an additional argument for the idea that the first 10 chapters existed separately and were added by the Bjarnar saga compiler during the saga construction. One could argue that it could be not an authorial but a generic division as the first part is set in Norway and deals a lot with St. Olaf; the change of genre could result in the change of style. However, Stylo counts the most frequent words, which are not so generically specific (like og, að, etc.); thus, the collaborative authorship still could have taken place. This would be an important result in context of the overall composition of the Bjarnar saga longer version as its structure shows traces of a very careful planning and also mirror composition (Glebova 2017, 18-33): could it be that the structure of one of the parts (maybe, the first one) influenced the other? Whatever be the case, while sewing together the existing material, the medieval compiler made an effort to create a solid text and this effort is worth studying with more attention.

Bibliography:

Andersson, Theodor M. (1967). The Icelandic Family Saga: An Analytic Reading. Cambridge, MA.

Boer, Richard C. (1893). Bjarnar saga Hítdælakappa, Halle.

Eder, M. (2015). “Rolling Stylometry.” Digital Scholarship in the Humanities, Vol. 31-3: 457–469.

Eder, M., Kestemont, M., Rybicki, J. (2013). “Stylometry with R: A Suite of Tools.” Digital Humanities 2013: Conference Abstracts. University of Nebraska–Lincoln: 487–489.

Finlay, A. “Nið, Adultery and Feud in Bjarnar saga Hítdælakappa.” Saga-Book of the Viking Society 23 (1990-1993): 158-178.

Finlay, A. The Saga of Bjorn, Champion of the Men of Hitardale, Enfield Lock, 2000.

Glebova D. A Case of An Odd Saga. Structure in Bjarnar saga Hítdælakappa. MA thesis, University of Iceland. Reykjavík, 2017 (http://hdl.handle.net/1946/27130).

Rybicki, J., Eder, M., Hoover, David L. “Computational Stylistics and Text Analysis.” In Doing Digital Humanities: Practice, Training, Research, edited by Constance Compton, Richard J. Lane, Ray Siemens. London, New York: 123-144.

Sigurður Nordal, and Guðni Jónsson (eds.) “Bjarnar saga Hítdælakappa.” In Borgfirðinga sögur, Íslenzk fornrit 3, 111-211. Reykjavík, 1938.

Simon, John LeC. A Critical Edition of Bjarnar saga Hítdælakappa. Vol. 1-2. Unpublished PhD thesis, University of London, 1966.

Poster [abstract]

The Bank of Finnish Terminology in Arts and Sciences – a new form of academic collaboration and publishing

Johanna Enqvist, Tiina Onikki-Rantajääskö

University of Helsinki,

This presentation concerns the multidisciplinary research infrastructure project “Bank of Finnish Terminology in Arts and Sciences (BFT)” as an innovative form of academic collaboration and publishing. The BFT, which was launched in 2012, aims to build a permanent and continuously updated terminological database for all fields of research in Finland. Content for the BFT is created by niche-sourcing, where the participation is limited to a particular group of experts in the participating subject fields. The project maintains a wiki-based website which offers an open and collaborative platform for terminological work and a discussion forum available to all registered users.

The BFT thus opens not only the results but the whole academic procedure where the knowledge is constantly produced, evaluated, discussed and updated in an ongoing process. The BFT also provides an inclusive arena for all the interested people – students, journalists, translators and enthusiasts – to participate in the discussions relating to concepts and terms in Finnish research. Based on the knowledge and experiences accumulated during the BFT project we will reflect on the benefits, challenges, and future prospects of this innovative and globally unique approach. Furthermore, we will consider the possibilities and opportunities opening up especially in terms of digital humanities.

Poster [publication ready]

The Swedish Language Bank 2018: Research Resources for Text, Speech, & Society

Lars Borin¹, Markus Forsberg¹, Jens Edlund², Rickard Domeij³

¹University of Gothenburg; ²KTH Royal Institute of Technology; ³The Institute for Language and Folklore

We present an expanded version of the Swedish research resource the Swedish Language Bank. The Language Bank, which has supported national and inter-national research for over four decades, will now add two branches, one focus-ing on speech and one on societal aspect of language, to its existing organiza-tion, which targets text.

Poster [abstract]

Handwritten Text Recognition and 19th Century Court Records

Maria Kallio

National Archives Finland,

This paper will demonstrate how the READ project is developing new technologies that will allow computers to automatically process and search handwritten historical documents. These technologies are brought together in the Transkribus platform, which can be downloaded free of charge at https://transkribus.eu/Transkribus/. Transkribus enables scholars with no in-depth technological knowledge to freely access and exploit algorithms which can automatically process handwritten text. Although there is already a rather sound workflow in place, the platform needs human input in order to ensure the quality of the recognition. The technology must be trained by being shown examples of images of documents and their accurate transcriptions. This helps it to understand the patterns which make up characters and words. This training data is used to create a Handwritten Text Recognition model which is specific to a particular collection of documents. The more training data there is, the more accurate the Handwritten Text Recognition can become.

Once a Handwritten Text Recognition model has been created, it can be applied to other pages from the same collection of documents. The machine analyses the image of the handwriting and then produces textual information about the words and their position on the page, providing best guesses and alternative suggestions for each word, with measures of confidence. This process allows Transkribus to provide the automatic transcription and full-text search of a document collection at high levels of accuracy.

For the quality of the text recognition, the amount of training material is paramount. Current tests suggest that models for specific style of handwriting can reach a Character Error Rate of less than 5%. Transcripts with a Character Error Rate of 10% or below can be generally understood by humans and used for adequate keyword searches. A low Character Error Rate also makes it relatively quick and easy for human transcribers to correct the output of the Handwritten Text Recognition engine. These corrections can then be fed back into the model in order to make it more accurate. These levels also compare favorably with Optical Character Recognition, where 95-98% accuracy for early prints is possible.

Of even more interest is the fact that a well-trained model is able to sustain a certain amount of differences in handwriting. Therefore, it can be expected that, with a large amount of training material, it will be possible to recognize the writing of an entire epoch (e.g. eighteenth-century English writing), in addition to that of specific writers.

The case study of this paper is the Finnish court records from the 19th century. The notification records which contain cases concerning guardianships, titles and marriage settlements, form an enormous collection of over 600 000 pages. Although the material is in digital form, the usability is still poor due to the lack of indices or finding aids. With the help of the Handwritten Text Recognition the National Archives have the chance to provide the material in computer-readable form which allows users to search and use the records in whole new way.

Poster [publication ready]

An approach to unsupervised ontology term tagging of dependency-parsed text using a Self-Organizing Map (SOM)

Seppo Nyrkkö

University of Helsinki

Tagging ontology-based terms on existing text content is a task often requiring human effort. Each ontology may have their own structure and schema for describing terms, making automation non-trivial. I suggest a machine learning estimation technique for term tagging which can learn semantic tagging from a set of sample ontologies with given textual examples, and expand its use for analyzing a large text corpus by comparing the found syntactic features in the text. The tagging technique is based on a dependency parsed text input and an unsupervised machine learning model, the Self-Organizing Map (SOM).

Poster [abstract]

Comparing Topic Model Stability Between Finnish, Swedish and French

Simon Hengchen, Antti Kanner, Eetu Mäkelä, Jani Marjanen

University of Helsinki

Comparing Topic Model Stability Between Finnish, Swedish and French

1 Abstract

In the recent years, topic modelling has gained increasing attention in the humanities.

Unfortunately, little has been done to determine whether the output produced by this range of probabilistic algorithms is revealing signal or merely producing noise, nor how well it performs on other languages than English.

In this paper, we set out to compare topic models of parallel corpora in Finnish, Swedish, and French, and propose a method to determine how well the topic modelling algorithms perform on those languages.

2 Context

Topic modelling (TM) is a well-known (following the work of (4; 5)) yet badly understood range of algorithms within the humanities.

While a variety of studies within the humanities make use of topic models to answer historical questions (see (2) for a thorough survey), there is no tried and true method that ascertains that the probabilistic algorithm reveals signal and is not merely responding to noise.

The rule of thumb is generally that if the results are interesting and reveal a prior intuition by a domain expert, they are considered correct -- in the sense

that they are a valid entry point into a humongous dataset, and that the proper work of historical research is to be then manually carried out on a subset selected by the algorithm.

As pointed out in previous work (7; 3), this, combined with the fact that many humanistic corpora are on the small side, "the threshold for the utility of topic modelling across DH projects is as yet highly unclear."

Similarly, topic instability "may lead to research being based on incorrect foundational assumptions regarding the presence or clustering of conceptual fields on a body of work or source material" (3).

Whilst topic modelling techniques are considered language-independent, i.e. "use[] no manually constructed dictionaries, knowledge bases, semantic networks, grammars, syntactic parsers, or morphologies, or the like" (6), they encode keyassumptions about the statistical properties of language.

These assumptions are often developed with English in mind and generalised to other languages without much consideration.

We maintain that these algorithms are not language-independent, but language-agnostic at best, and that accounting for discrepancies in how different languages are processed by the same algorithms is necessary basic research for more applied, context-oriented research -- especially for the historical development of public discourses in multilingual societies or phenomena where structures of discourse flow over language borders.

Indeed, some languages heavily rely on compounding -- the creation of a word through the combination of two or more stems -- in word formation, while others use determiners to combine simple words.

If one considers a white space as the delimitation between words (as is usually done with languages making use of the Latin alphabet), the first tendency results in a richer vocabulary than the second, hence influencing TM algorithms that follow of the bag-of-words approach.

Similarly, differences in grammar -- for example, French adjectives must agree in gender and number with the noun they modify, something that does not exist in other languages like English -- reinforce those discrepancies.

Nonetheless, most of this happens in the fuzzy and non-standard preprocessing stage of topic modelling, and the argument could be made that the language neutrality of TM algorithms rests more on it being underspecified with regard to how to pre-process the language.

In this paper, we propose to compare topic models on a custom-made parallel corpus in Finnish, Swedish, and French.

By selecting those languages, we have a glimpse of how a selection of different languages are processed by TM algorithms.

While concentrating on languages spoken in Europe and languages of interest of our collaborative network of linguists, historians and computer scientists, we are still able examine two crucial variables: one of genetic and one of cultural relatedness.

French and Swedish belong to Indo-European (Romance and Germanic branches, respectively) and Finnish is a Finno-Ugrian language.

Finnish and Swedish on the other hand share a long history of close language contact and cultural convergence.

Because of this, Finnish contains a large number of Swedish loan words, and, perceivably, similar conceptual systems.

3 Methodology

To explore our hypothesis, we use a parallel corpus of born-digital textual data in Finnish, Swedish, and French.

Once the corpus is constituted, it becomes possible to apply LDA (1) and HDA (9) -- LDA is parametrised by humans, whereas HDA will attempt to automatically determine the best configuration possible.

The resulting models for each language are stored, the corpora reduced in size, LDA is re-applied, the models are stored, corpora re-reduced, etc.

Topic models are compared manually between languages at each stage, and programmatically between stages, using the Jaccard Index (8), for all languages.

The same workflow is then applied to the lemmatised version of the above-mentioned corpora, and results compared.

Bibliography

[1] Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993{1022 (2003)

[2] Brauer, R., Fridlund, M.: Historicizing topic models, a distant reading of topic modeling texts within historical studies. In: International Conference on Cultural Research in the context of \Digital Humanities", St. Petersburg: Russian State Herzen University (2013)

[3] Hengchen, S., O'Connor, A., Munnelly, G., Edmond, J.: Comparing topic model stability across language and size. In: Proceedings of the Japanese Association for Digital Humanities Conference 2016 (2016)

[4] Jockers, M.L.: Macroanalysis: Digital methods and literary history. University of Illinois Press (2013)

[5] Jockers, M.L., Mimno, D.: Significant themes in 19th-century literature. Poetics 41(6), 750{769 (2013)

[6] Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse processes 25(2-3), 259{284 (1998)

[7] Munnelly, G., O'Connor, A., Edmond, J., Lawless, S.: Finding meaning in the chaos (2015)

[8] Real, R., Vargas, J.M.: The probabilistic basis of jaccard's index of similarity. Systematic biology 45(3), 380{385 (1996)

[9] Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. Journal of the American Statistical Association 101(476), 1566{1581 (2006)

Poster [abstract]

ARKWORK: Archaeological practices and knowledge in the digital environment

Suzie Thomas², Isto Huvila¹, Costis Dallas³, Rimvydas Laužikas⁴, Antonia Davidovic⁹, Arianna Traviglia⁶, Gísli Pálsson⁷, Eleftheria Paliou⁸, Jeremy Huggett⁵, Henriette Roued⁶

¹Uppsala University,; ²University of Helsinki; ³University of Toronto; ⁴Vilnius University; ⁵University of Glasgow; ⁶University of Venice; ⁷Umeå University; ⁸University of Copenhagen; ⁹Independent researcher

Archaeology and material cultural heritage have often enjoyed a particular status as a form of heritage that has captured the public imagination. As researchers from many backgrounds have discussed, it has become the locus for the expression and negotiation of European, local, regional, national and intra-national cultural identities, for public policy regarding the preservation and management of cultural resources, and for societal value in the context of education, tourism, leisure and well-being. The material presence of objects and structures in European cities and landscapes, the range of archaeological collections in museums around the world, the monumentality of the major archaeological sites, and the popular and non-professional interest in the material past are only a few of the reasons why archaeology has become a linchpin in the discussions on how emerging digital technologies and digitization can be leveraged for societal benefit. However, at the time when nations and the European community are making considerable investments in creating technologies, infrastructures and standards for digitization, preservation and dissemination of archaeological knowledge, critical understanding of the means and practices of knowledge production in and about archaeology from complementary disciplinary perspectives and across European countries remains fragmentary, and in urgent need of concertation.

In contrast to the rapid development of digital infrastructures and tools for archaeological work, relatively little is known about how digital information, tools and infrastructures are used by archaeologists and other users and producers of archaeological information such as archaeological and museum volunteers, avocational hobbyists, and others. Digital technologies (infrastructures, methods and resources) are reconfiguring aspects of archaeology across and beyond the lifecycle (i.e., also "in the wild"), from archaeological data capture in fieldwork to scholarly publication and community access/entanglement.Both archaeologists and researchers in other fields, from disciplines such as museum studies, ethnology, anthropology, information studies and science and technology studies have conducted research on the topic but so far, their efforts have tended to be somewhat fragmented and anecdotal. This is surprising, as the need of better understanding of archaeological practices and knowledge work has been identified for many years as a major impediment to realizing the potential of infrastructural and tools-related developments in archaeology. The shifts in archaeological practice, and in how digital technology is used for archaeological purposes, calls for a radically transdisciplinary (if not interdisciplinary) approach that brings together perspectives from reflexive, theoretically and methodologically-aware archaeology, information research, and sociological, anthropological and organizational studies of practice.

This poster presents the COST Action “Archaeological practices and knowledge work in the digital environment” (http://www.cost.eu/COST_Actions/ca/CA15201 - ARKWORK), an EU-funded network which brings together researchers, practitioners, and research projects studying archaeological practices, knowledge production and use, social impact and industrial potential of archaeological knowledge to present and highlight the on-going work on the topic around Europe.

ARKWORK (https://www.arkwork.eu/) consists of four Working Groups (WGs), with a common objective to discuss and practice the possibilities for applying the understanding of archaeological knowledge production to tackle on-going societal challenges and the development of appropriate management/leadership structures for archaeological heritage. The individual WGs have the following specific but complementary themes and objectives:

WG1 - Archaeological fieldwork

Objectives: To bring together and develop the international transdisciplinary state-of-the-art of the current multidisciplinary research on archaeological fieldwork. How archaeologists are conducting fieldwork and documenting their work and findings in different countries and contexts and how this knowledge can be used to make contributions to developing fieldwork practices and the use and usability of archaeological documentation by the different stakeholder groups in the society.

WG2 - Knowledge production and archaeological collections

Objectives: To integrate and push forward the current state-of-the-art in understanding and facilitating the use and curation of (museum) collections and repositories of archaeological data for knowledge production in the society.

WG3 - Archaeological knowledge production and global communities

Objectives: To bring together and develop the current state-of-the-art on the global communities (including indigenous communities, amateurs, neo-paganism movement, geographical and ideological identity networks and etc.) as producers and users in archaeological knowledge production e.g. in terms of highlighting community needs, approaches to communication of archaeological heritage, crowdsourcing and volunteer participation.

WG4 - Archaeological scholarship

Objectives: To integrate and push forward the current state-of-the-art in study of archaeological scholarship including academic, professional and citizen science based scientific and scholarly work.

In our poster we outline each of the working groups and provide a clear overview of the purposes and aspirations of the COST Action Network ARKWORK

Poster [publication ready]

Research and development efforts on the digitized historical newspaper and journal collection of The National Library of Finland

Kimmo Kettunen, Mika Koistinen, Teemu Ruokolainen

University of Helsinki, Finland,

The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12 million pages mainly in Finnish and Swedish. Out of these about 5.1 million pages are freely available on the web site digi.kansalliskirjasto.fi (Digi). The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1920. The last ten years, 1911–1920, were opened in February 2017.

The digitized collection of NLF is part of globally expanding network of library produced historical data that offers researchers and lay persons insight into past. In 2012 it was estimated that there were about 129 million pages and 24 000 titles of digitized newspapers in Europe [1]. A very conservative estimation about worldwide number of titles is 45 000 [2]. The current number of available data is probably already much bigger, as the national libraries have been working steadily with digitization both in Europe, Northern America and rest of the world.

This paper presents work that has been carried out in the NLF related to the historical newspaper and journal collection. We offer an overall account of research and development related to the data.

Poster [abstract]

Medieval Publishing from c. 1000 to 1500

Samu Kristian Niskanen, Lauri Iisakki Leinonen

Helsinki University

Medieval Publishing from c. 1000 to 1500 (MedPub) is a five-year project funded by the European Research Council, based at Helsinki University, and running from 2017 2022. The project seeks to define the medieval act of publishing, focusing on Latin authors active during the period from c. 1000 to 1500. A part of the project is to establish a database of networks of publishing. The proposed paper will discuss the main aspects of the projected database and the process of data-gathering.

MedPub’s research hypothesis is that publication strategies were not a constant but were liable to change, and that different social, literary, institutional, and technical milieux fostered different approaches to publishing. As we have already proved this proposition, the project is now advancing toward the next step, the ultimate aim of which is to complement the perception of societal and cultural changes that took place during the period from c. 1000 and 1500.

For the purposes of that undertaking, we define ‘publishing’ as a social act, involving at least two parties, an author and an audience, not necessarily always brought together. The former prepares a literary work and then makes it available to the latter. Medieval publishing was probably more often a more complex process. It could engage more parties than the two, such as commentators, dedicatees, and commissioners. The social status of these networks ranged from mediocre to grand. They could consist of otherwise unknown monks; or they could include popes and emperors.

We propose that the composition of such literary networks was broadly reactive to large-scale societal and cultural changes. If so, networks of publishing can serve as a vantage point for the observation of continuity and change in medieval societies. We shall collect and analyse an abundance of data of publishing networks in order to trace how their composition in various contexts may reflect the wider world. It is that last-mentioned aspect that is the subject of this proposal.

It is a central fact for this undertaking that medieval works very often include information on dedication, commission, and commendation; and that, more often than not, this evidence is uncomplicated to collect because the statements in question tend to be short and uniform and they normally appear in the prefaces and dedicatory letters with which medieval authors often opened their works. What is more, such accounts manifestly indicate a bond between two or more parties. By virtue of these features, the evidence in question can be collected in the quantities needed for large-scale statistical analysis and processed electronically. The function and form of medieval references to dedication and commission, furthermore, remained largely a constant. Eleventh-century dedications resemble those from, say, the fourteenth century. By virtue of such uniformity the data of dedications and commissions may well constitute a unique pool of evidence of social interaction in the Middle Ages. For the data of dedications and commissions can be employed as statistical evidence in various regional, chronological, social, and institutional contexts, something that is very rare in medieval studies.

The proposed paper will introduce the categories of information the database is to embrace and put forward for discussion the modus operandi of how the data of dedications and commissions will be harvested.

Poster [abstract]

Making a bibliography using metadata

Lars Bagøien Johnsen, Arthur Tennøe

National Library of Norway, Norway,

In this presentation we will discuss how one might create a bibliography using metadata taken from libraries in conjunction with other sources. As metadata, like topic keywords and Dewey decimal classification, is digitally available our focus is on metadata, although we also look at book contents where it is possible.

Poster [abstract]

Network Analysis, Network Modeling, and Historical Big Data: The New Networks of Japanese Americans in World War II

Saara Kekki

University of Helsinki

Network analysis has become a promising methodology for studying a wide variety of systems, including historical populations. It brings new dimensions into the study of questions that social scientists and historians might traditionally ask, and allows for new questions that were previously impractical or impossible to answer using traditional methods. The increasing availability of digitized archival material and big data, however, are making it more appealing. When coupled with custom algorithms and interactive visualization tools, network analysis can produce remarkable new insights.

In my ongoing doctoral research, I am employing network analysis and modeling to study the Japanese American incarceration in World War II (internment). Incarceration and the government-led dispersal of Japanese Americans disrupted the lives of some 110,000 people, including over 70,000 US citizens of Japanese ancestry, for the duration of the war and beyond. Many lost their former homes and enterprises and had to start their lives over after the war. Incarceration also had a very concrete impact on the communities: about 50% of those interned did not return to their old homes.

This paper explores the changes that took place in the Japanese American community of Heart Mountain Relocation Center in Wyoming. I will especially investigate the political networks and power relations of the incarceration community. My aim is twofold: on the one hand, to discuss the changes in networks caused by incarceration and dispersal, and on the other, to address some opportunities and challenges presented by the method for the study of history.

Poster [abstract]

SuALT: Collaborative Research Infrastructure for Archaeological Finds and Public Engagement through Linked Open Data

Suzie Thomas¹, Anna Wessman¹, Jouni Tuominen^2,3, Mikko Koho², Esko Ikkala², Eero Hyvönen^2,3, Ville Rohiola⁴, Ulla Salmela⁴

¹University of Helsinki,Department of Philosophy, History, Culture and Art Studies; ²Aalto University, Semantic Computing Research Group (SeCo); ³University of Helsinki, HELDIG – Helsinki Centre for Digital Humanities; ⁴National Board of Antiquities, Library, Archives and Archaeological Collections

The Finnish Archaeological Finds Recording Linked Database (Suomen arkeologisten löytöjen linkitetty tietokanta – SuALT) is a concept for a digital web service catering for discoveries of archaeological material made by the public; especially, but not exclusively, metal detectorists. SuALT, a consortium project funded by the Academy of Finland and commenced in September 2017, has key outputs at every stage of its development. Ultimately it provides a sustainable output in the form of Linked Data, continuing to facilitate new public engagements with cultural heritage, and research opportunities, long after the project has ended.

While prohibited in some countries, metal detecting is legal in Finland, provided certain rules are followed, such as prompt reporting of finds to the appropriate authorities and avoidance of legally-protected sites. Despite misgivings by some about the value of researching metal-detected finds, others have demonstrated the potential of researching such finds, for example uncovering previously unknown artefact typologies. Engaging non-professionals with cultural heritage also contributes to the democratization of archaeology, and empowers citizens. In Finland metal detecting has grown rapidly in recent years. In 2011 the Archaeological Collections registered 31 single or assemblages of stray finds. In 2014, over 2700 objects were registered, in 2015, near 3000. In 2016 over 2500 finds were registered. When the finds are reported correctly, their research value is significant. The Finnish Antiquities Act §16 obligates the finder of an object for which the owner is not known, and which can be expected to be at least 100 years old, to submit or report the object and associated information to the National Board of Antiquities (Museovirasto – NBA); the agency responsible for cultural heritage management in Finland. There is also a risk, as finders get older and even pass away, that their discoveries and collections will remain unrecorded and that all associated information is lost permanently.

In the current state of the art, while archaeologists increasingly use finds information and other data, utilization is still limited. Data can be hard to find, and available open data remains fragmented. SuALT will speed up the process of recording finds data. Because much of this data will be from outside of formal archaeological excavations, it may shed light on sites and features not usually picked up through ‘traditional’ fieldwork approaches, such as previously unknown conflict sites. The interdisciplinary approach and inclusion of user research promotes collaboration among the infrastructure’s producers, processors and consumers. By linking in with European projects, SuALT enables not only national and regional studies, but also contributes to international and transnational studies. This is significant for studies of different archaeological periods, for which the material culture usually transcends contemporary national boundaries. Ethical aspects are challenged due to the debates around engagement with metal detectorists and other artefact hunters by cultural heritage professionals and researchers, and we address head-on the wider questions around data sharing and knowledge ownership, and of working with human subjects. This includes the issues, as identified by colleagues working similar projects elsewhere, around the concerns of metal detectorists and other finders about sharing findspot information. Finally, the usability of datasets has to be addressed, considering for example controlled vocabulary to ease object type categorization, interoperability with other datasets, and the mechanics of verification and publication processes.

The project is unique in responding to the archaeological conditions in Finland, and in providing solutions to its users’ needs within the context of Finnish society and cultural heritage legislation. While it focuses primarily on the metal detecting community, its results and the software tools developed are applicable more generally to other fields of citizen science in cultural heritage, and even beyond. For example, in many areas of collecting (e.g. coins, stamps, guns, or art), much cultural heritage knowledge as well as collections are accumulated and maintained by skilful amateurs and private collectors. Fostering collaboration, and integrating and linking these resources with those in national memory organizations would be beneficial to all parties involved, and points to future applications of the model developed by SuALT. Furthermore, there is scope to integrate SuALT into wider digital humanities networks such as DARIAH (http://www.dariah.eu).

Framing SuALT’s development as a consortium enables us to ask important questions even at development stages, with the benefit of expertise from diverse disciplines and research environments. The benefits of SuALT, aside from the huge potential for regional, national, and transnational research projects and international collaboration, are that it offers long term savings on costs, shares expertise and provides greater sustainability than already possible. We will explore the feasibility of publishing the finds data through international aggregation portals, such as Europeana (http://www.europeana.eu) for cultural heritage content, as well as working closely with colleagues in countries that already have established national finds databases. The technical implementation also respects the enterprise architecture of Finnish public government. Existing Open Source solutions are further developed and integrated, for example the GIS platform Oskari.org (http://oskari.org) for geodata developed by the National Land Survey with the Linked Data based Finnish Ontology Service of Historical Places and Maps (http://hipla.fi). SuALT’s data is also disseminated through Finna (http://www.finna.fi), a leading service for searching cultural information in Finland.

SuALT consists of three subprojects: subproject I “User Needs and Public Cultural Heritage Interactions” hosted by University of Helsinki; subproject II “National Linked Open Data Service of Archaeological Finds in Finland” hosted by Aalto University, and subproject III “Ensuring Sustainability of SuALT” hosted by the NBA.

The primary aim of SuALT is to produce an open Linked Data service which is used by data producers (namely the metal detectorists and other finders of archaeological material), by data researchers (such as archaeologists, museum curators and the wider public), and by cultural heritage managers (NBA). More specifically, the aims are:

a. To discover and analyse the needs of potential users of the resource, and to factor these findings into its development;

b. To develop metadata models and related ontologies for the data that take into account the specific needs of this particular infrastructure, informed by existing models;

c. To develop the Linked Data model in a way that makes it semantically interoperable with existing cultural heritage databases within Finland;

d. To develop the Linked Data model in a way that makes it semantically interoperable with comparable ‘finds databases’ elsewhere in Europe, and

e. To test the data resulting from SuALT through exploratory research of the datasets for archaeological research purposes for cultural heritage and collection management work.

The project corresponds closely with the strategic plans of the NBA and responds to the growth of metal detecting in Finland. Internationally, it corresponds with the development of comparable schemes in other European countries and regions, such as Flanders (MetaaldEtectie en Archeologie – MEDEA initiated in 2014), and Denmark and the Netherlands (Digitale Metaldetektorfund or DIgital MEtal detector finds – DIME, and Portable Antiquities in the Netherlands – PAN, both initiated in 2016). It takes inspiration from the Portable Antiquities Scheme (PAS) Finds Database (https://finds.org.uk/database) in England and Wales. These all aspire to an ultimate goal of a pan-European research infrastructure, and will work together to seek a larger international collaborative research grant in the future. A contribution of our work in relation to the other European projects is to employ the Linked Data paradigm, which facilitates better interoperability with related datasets, additional data enrichment based on well-defined semantics and reasoning, and therefore better means for analysing and using the finds data in research and applications.

The expected scientific impacts are that the process of developing SuALT, including critically analysing comparable resources, user group research, and creating innovative solutions, will in themselves produce a rich body of interdisciplinary academic output. This will be disseminated in peer reviewed journals and at selected conferences across several disciplinary boundaries including Computer Science, Archaeology, and Cultural Heritage Studies. It also links in, at a crucial moment in the development of digital heritage management, with parallel resources elsewhere in Europe. This means that not only can a coordinated and international approach be taken in development, but that it is extremely timely, taking advantage of the opportunity to benefit from the experiences and perspectives of colleagues pursuing similar resources. SuALT ensures that Finnish cultural heritage management is at the forefront of digital heritage. The project also carries out a small-scale ‘test’ project using the database as it forms, and in this way contributes to the field of artefact studies. The contribution to future knowledge sits at a number of levels. There are technical challenges to create the linked database in a way that complements and is interoperable with existing national and international infrastructures. Solving these challenges generates contributions to understanding digital data management and service. The process of consulting users represents an important case study in formative evaluation of particular interest groups with regard to digital heritage and citizen science, as well as shedding further light on different perceptions and uses of cultural heritage. SuALT relates to the emerging trend of publishing open science data, facilitating the analysis and reuse of the data, exemplified by e.g. DataONE (http://www.dataone.org) and Open Science Data Cloud (http://www.opensciencedatacloud.org).

We hypothesise that SuALT will result in a sustainable digital data resource that responds to the different user needs, and which provides high quality archaeological research which draws on data from Finland. SuALT also enables integration with comparative data from abroad. Outputs throughout the development process represent important contributions to research into digital heritage applications and semantic computing, going the needs of the scientific community. The selected Linked Data methodology is suitable for archaeology and cultural heritage management due to the need to combine and connect heterogeneous data collections in the field (e.g. museum collections, finds databases abroad) and other datasets, such as vocabularies of places, persons, and time periods, benefiting cultural heritage professionals. Publishing the finds database as open data using standardised metadata formats facilitates the data’s re-use, fostering new research by the scientific community but also the development of novel applications for professionals and citizens. Taking a strategic approach to the challenge of creating this resource, and treating it as a research project, rather than developing an ad hoc resource, ensures that the project’s legacy is a significant and long-term contribution to digital curation of public-generated archaeological data.

As its key societal impact, SuALT provides a vital interface for non-professionals to contribute to and benefit from Finland’s archaeological record, and to integrate this with comparable datasets from abroad. The project enhances cooperation between non-professionals and cultural heritage managers. Careful user research ensures that SuALT offers means of engagement and access to data and other information that is usable and meaningful to a wide range of users, from metal detectorists and amateur historians, through to professional curators, cultural heritage managers, and academic researchers, domestically and abroad. SuALT’s results are not limited to metal detection but have a wider impact: the same key challenges of engaging amateur collectors to collaborate with memory organization experts in citizen science are encountered in virtually all fields of collecting and maintaining tangible and intangible cultural heritage.

The process of developing SuALT provides an unprecedented opportunity to research the use of digital platforms to engage the public with archaeological heritage in Finland. Inspired by successful initiatives such as PAS and MEDEA, the potential for individuals to self-record their finds also echoes the emerging use of crowdsourcing for public archaeology initiatives. Thus, SuALT offers a significant opportunity to contribute to further understanding digital cultural heritage and its uses, including its role within society. It is likely that the coordination of SuALT with digital finds recording initiatives in other countries will lead to a transnational platform for finds recording, giving Finland an opportunity to be at the forefront of digital heritage-based citizen science research and development.

Poster [abstract]

Identifying poetry based on library catalogue metadata

Hege Roivainen

University of Helsinki,

Changes in printing reflect historical turning points: what has been printed, when, where and by whom are all derivatives of contemporary events and situations. Excessive need for war propaganda brings out more pamphlets from the printing presses, the university towns produce dissertations, which scientific development can be deduced from and strict oppression and censorship might allow only religious publications by government-approved publishers. The history of printing has been extensively studied and numerous monographs exist. However, most of the research has been qualitative studies based on close reading requiring a profound knowledge of the subject matter, yet still being unable to verify the extent of the new innovations. For example, close readings of library catalogues does not reveal, at least easily, the timeline of Luther’s publications, or what portion of books actually were octavo-sized and when the increase in this format occurred.

One of the sources for these kinds of studies are national library metadata catalogs which contain information about physical book size, page counts, publishers, publication places and so forth. These catalogs have been researched in ways making use of quantitative analysis. The advantage of national library catalogs is that they often are more or less complete, having records of practically everything published in a certain country or linguistic area in a certain time period. The computational approach to them has enabled researchers to connect historical turning points to the effect on printing, and the impact of a new concept has been measured against the amount of re-publications, or the spread, of a book introducing a new idea. What is more, linking library metadata to the full text of the books has made it possible to analyze the change in the usage of words in massive corpora, while still limiting analysis to relevant books.

In all these cases, computational methods work better the more complete the corpus is. However, library catalogues often lack annotations for one reason or another: annotating resources might have been cut at a certain point in time, or the annotation rules may have varied between different libraries in cases where catalogues have been amalgamated, or the rules could have just changed.

One area that is particularly important for subcorpora research is genre. The genre field, when annotated for each of the metadata records, could be used to restrict the corpus to contain every one of the books that are needed and nothing more. From this subset there is a possibility of drawing timelines or graphs based on bibliographic metadata, or in the case of full texts existing, the language or contents of a complete corpus could be analysed. Despite the significance of the genre information, that particular annotation bit is often lacking.

In English Short Title Catalogue (ESTC) the genre information exists for approximately one fourth of the records. This should be enough for teaching a model for machine learning and trying to deduce the genre information, rather than relying solely on the annotations of librarians. The metadata field containing genre information in ESTC can contain more than one value. In most cases this means having a category and its subcategories as different values, but not always. Because of the complex definition of genre in ESTC this paper focuses on one genre only: poetry. Besides being a relatively common genre, poetry is also of interest to literary researchers. Having a nearly complete subset of English poetry would allow for large-scale quantitative poetry analysis.

The downside to library metadata catalogues is, that they contain merely the metadata, not the complete unabridged texts, which would be beneficial for machine learning modeling. I tackled this shortcoming by creating several models each packed with similar features within that set. The main ingredient for these feature sets was a concatenation of the main title and the subtitle from the library metadata. From these concatenations I created one feature set contained easily calculable features known from the earliest stylometric research, such as word counts and sentence lengths. Another set I collected with bag-of-words method taking the frequencies of the most common words from a subset of poetry book titles. I also built one set for part-of-speech (POS) tags and another one for POS trigrams. Some feature sets were extracted from the other metadata fields. Physical book size, page count, topic and the same author having published a poetry book proved worthy in the classification.

From these feature sets I handpicked the best performing features into one superset. The resulting model performed really good: despite the compactness of the metadata, the poetry books could be tracked with a precision over 90% and a recall over 86%. I then made another run with the superset to seek the poetry books, which did not have genre field annotated in the catalogue. Combining the results from the run with close reading revealed over 14,000 unannotated poetry books. I sampled one hundred of both poetry and non-poetry books to manually estimate the correctness of the predictions and found out an annotation bias in the catalogue. The bias seems to come from the fact, that the genre information has been annotated more frequently for broadside poetry books, than for the other broadsides. Excluding broadsides from my samples I got a recall value 94% and precision 98%.

My research strongly suggest, that semi-supervised learning can be applied with library catalogues to fill in missing annotations, but this requires close attention to avoid possible pitfalls.

Poster [publication ready]

Open Digital Humanities: International Relations in PARTHENOS

Bente Maegaard

University of Copenhagen, CLARIN ERIC

One of the strong instruments for the promotion of Open Science in Digital Humanities is research infrastructures. PARTHENOS is a European research infrastructure project, basically built upon collaboration between two large the research infrastructures in the humanities CLARIN and DARIAH, plus a number of other initiatives. PARTHENOS aims at strengthening the cohesion of research in the broad sector of Linguistic Studies, Humanities, Cultural Heritage, History, Archaeology and related fields. This is the context in which we should see the efforts related to international liaisons. This effort takes its point of departure in the existing international relations, so the first action was to collect information and to analyse it along different dimensions. Secondly, we want to analyse the purpose and aims of international collaboration. There are many ideas about how the international network may be strengthened and exploited, so that higher quality is obtained, and more data, tools and services are shared. The main task of the next year will be to first agree on a strategy and then implement it in collaboration with the rest of the project. By doing so, the PARTHENOS partners will be contributing even more to the European Open Science Policies.

Poster [abstract]

The New Face of Ethnography: Utilizing Cyberspace as an Alternative Study Site

Karen Lisa Deeming

University of California, Merced,

American adoption has a familiar mission to find families for children but becomes strange when turned on its head and exposed as an institution that instead finds children for families who are willing to pay any price for a child. Its evolution, from orphan trains to open adoptions, has answered questions about biological associations but has conflated the interconnection of identity with conflicting narratives of community, kinship and self. How do the experiences of the adoption constellation reconceptualize the national image of adoption as a win-win solution to a social problem? My research explores the language utilized in multiple adoption narratives to determine individual and universal feelings that adoptees, birth parents, and adoptive parents experience regarding the transfer of children in the United States and the long term emotional outcomes for these groups. My unique approach to ethnographic research includes a hybrid digital and humanistic approach using online and offline interactions to gather data.

As is the case with all methodology, online ethnography presents both benefits and problems. On the plus side, online communities break down the walls of networks, creating digitally mediated social spaces. The Internet provides a platform for social interactions where real and virtual worlds shift and conflate. Social interactions in cybernetic environments present another option for social researchers and offer significant advantages for data collection, collaboration, and maintenance of research relationships. For some research subjects, such as members of the adoption constellation, locating target groups presents challenges for domestic adoption researchers. Online groups such as Facebook pages dedicated to specific members of the adoption triad offer a resolution to this challenge, acting as self-sorted focus groups with participants eager to provide their narratives and experiences. Ethnography involves understanding how people experience their lives through observation and non-directed interaction, with a goal of observing participants’ behavior and reactions on their own terms; this can be achieved through the presumed anonymity of online interaction. Electronic ethnography provides valuable insights and data; however, on the negative side, the danger of groupthink in Facebook communities can both attract and generate homogeneous experiences regarding adoption issues. I argue that the benefit of online ethnography outweighs the problems and can provide important, previously unexpressed views to better analyze topics such as the adoption experience. Social interactions in cybernetic environments offer significant advantages for data collection, collaboration, and maintenance of research relationships as it remains a fluid yet stable alternate social space.

Late-Breaking Work

Elias Lönnrot Letters Online

Kirsi Keravuori, Maria Niku

Finnish Literature Society

The correspondence of Elias Lönnrot (1802–1884, doctor, philologist and creator of the national epic Kalevala) comprises of 2 500 letters or drafts written by Lönnrot and 3 500 letters received. Elias Lönnrot Letters Online (http://lonnrot.finlit.fi/omeka/), first published in April 2017, is the conlusion of several decades of research, of transcribing and digitizing letters and of writing commentaries. The online edition is designed not only for those interested in the life and work of Lönnrot himself, but more generally to scholars and general public interested in the work and mentality of the Finnish 19th century nationalistic academic community , their language practices both in Swedish and in Finnish, and in the study of epistolary culture. The rich, versatile correspondence offers source material for research in biography, folklores studies and literary studies; for general history as well as medical history and the history of ideas; for the study of ego documents and networks; and for corpus linguistics and history of language.

As of January 2018, the edition contains about 2000 letters and drafts of letters sent by Elias Lönnrot (1802-1884, doctor, philologist and creator of the national epic Kalevala). These are mostly private letters. The official letters, such as the medical reports submitted by Lönnrot in his office as a physician, will be added during 2018. The final stage will involve finding a suitable way of publishing for the approximately 3500 letters that Lönnrot received.

The edition is built on the open-source publishing platform Omeka. Each letter and draft of letter is published as facsimile images and an XML/TEI5 file, which contains metadata and transcription. The letters are organised into collections according to recipient, with the exception of for example Lönnrot's family letters, which are published in a single collection. An open text search covers the metadata and transcriptions. This is a faceted search powered by Apache's Solr which allows limiting the initial search by collection, date, language, type of document and writing location. In addition, Omeka's own search can be used to find letters based on a handful of metadata fields.

The solutions adopted for the Lönnrot edition differ in some respects from the established practices of digital publishing of manuscripts in the humanities. In particular, the TEI encoding of the transcriptions is lighter than in many other scholarly editions. Lönnrot's own markings – underlinings, additions, deletions – and unclear and indecipherable sections in the texts are encoded, but place and personal names are not. This is partially due to the extensive amount of work such detailed encoding would require, partially because the open text search provides quick and easy access to the same information.

The guiding principle of Elias Lönnrot Letters is openness of data. All the data contained in the edition is made openly available.

Firstly, the XML/TEI5 files are available for download, and researchers and other users are free to modify them for their own purposes. The users can download the XML/TEI5 files of all the letters, or of a smaller section such as an individual collection. The feature is also integrated in the open text search, and can be used both for all the results produced by a search and a smaller section of the results limited by one or more facets. Thus, an individual researcher can download the XML files of the letters and study them for example with the linguistic tools provided by the Language Bank of Finland. Similarly, the raw data is available for processing and modifying by those researchers who use and develop digital humanities tools and methods to solve research questions.

Secondly, the letter transcriptions are made available for download as plain text. Data in this format is needed for qualitative analysis tools like Atlas. In addition, researchers in humanities do not all need XML files but will benefit from the ability to store relevant data in an easily readable format.

Thirdly, users of the edition can export the statistical data contained in the facet listing of each search result for processing and visualization with tools like Excel. Statistical data like this is significant in handling large masses of data, as it can reveal aspects that would remain hidden when examining individual documents. For example, it may be relevant to a researcher in what era and with whom Lönnrot primarily discussed a given theme. The statistical data of the facet search readily reveals such information, while compiling such statistics by manually going through thousands of letters would be an impossibly long process.

The easy availability of data in Elias Lönnrot Letters Online will hopefully foster collaboration and enrich research in general. The SKS is already collaborating with Finn-Clarin and the Language Bank, which have received the XML/TEI5 files. As Lönnrot's letters form an exceptionally large collection of manuscripts written by one hand, a section of the letters together with their transcriptions was given to the international READ project, which is working to develop machine recognition of old handwritten texts. A third collaborating partner is the project "STRATAS – Intefacing structured and unstructured data in sociolinguistic research on language change".

Late-Breaking Work

KuKa Digi -project

Tiina H. Airaksinen, Anna-Leena Korpijärvi

University of Helsinki

This poster presents a sample of the Cultural Studies BA program’s Digital Leap project called KuKa Digi. The Digital Leap is a university wide project that aims to support digitalization in both learning and teaching in the new degree programs at the University of Helsinki. For more information on the University of Helsinki’s Digital Leap program, please refer to: http://blogs.helsinki.fi/digiloikka/ . The new Bachelor’s Program in Cultural Studies, was among the projects selected for the 2018-2019 round of the Digital Leap. The primary goal of the KuKa Digi project is to produce meaningful digital material for both teaching and learning purposes. The KuKa Digi project aims to develop the program’s courses, learning environments and materials into a more digital direction. Another goal of the project is to produce an introductory MOOC –course on Cultural Studies for university students, as well as students studying for their A-levels, who may be planning to apply for the Cultural Studies BA program. Finally, we will write a research article to assess the use of digital environments in teaching and learning processes within Cultural Studies BA program. Kuka Digi –project encourages students and teachers to co-operatively plan digital learning environments that are also useful in building up students’ academic portfolio and enhance their working life skills.

The core idea of the project is to create a digital platform or database for teachers, researchers and students in the field of Cultural Studies. Academic networking sites do exist, however they are not without issues. Many of them are either not accessible, or very useful for students, who have not developed their academic careers very far yet. In addition to this, some of these sites are only partially free of charge. The digital platform will act as a place where students, teachers and researchers alike can have the opportunity to network, advertise their expertise and specialization as well as, come into contact with the media, cultural agencies, companies and much more. The general vision for this platform is that it will be user friendly, flexible as well as, act as an “academic Linked In”. The database will be available in Finnish, Swedish and English. The database will include the current students, teachers and experts, who are associated with the program. Furthermore, the platform will include a feature called the digital

portfolio. This will be especially useful for our students, as it is intended to be a digital tool with which they can develop their own expertise within the field of Cultural Studies. Finally, the portfolio will act as a digital business card for the students. The Project poster presented at the conference illustrates the ideas and concepts for the platform in more detail.

For more information on the project and its other goals, please refer to the project blog at:

http://blogs.helsinki.fi/kuka-digi/

Late-Breaking Work

Topic modelling and qualitative textual analysis

Karoliina Isoaho, Daria Gritsenko

University of Helsinki,

The pursuit of big data is transforming qualitative textual analysis—a laborious activity that has conventionally been executed manually by researchers. Access to data of unprecedented scale and scope has created a need to both analyse large data sets efficiently and react to their emergence in a near-real-time manner (Mills, 2017). As a result, research practices are also changing. A growing number of scholars have experimented with using machine learning as the main or complementary method for text analysis. Even if the most audacious assumptions ‘on the superior forms of intelligence and erudition’ of big data analysis are today critically challenged by qualitative and mixed-method researchers (Mills, 2017: 2), it is imperative for scholars using qualitative methods to consider the role of computational techniques in their research (Janasik, Honkela and Bruun, 2009). Social scientists are especially intrigued by the potential of topic modelling (TM), a machine learning method for big data analysis (Blei, 2012), as a tool for analysis of textual data.

This research contributes to a critical discussion in social science methodologies: how topic modeling can concretely be incorporated into existing processes of qualitative textual analysis and interpretation. Some recent studies paid attention to the methodological dimensions of TM vis-à-vis textual analysis. However, these developments remain sporadic, exemplifying a need for a systematic account of the conditions under which TM can be useful for social scientists engaged in textual analysis. This paper builds upon the existing discussions, and takes a step further by comparing the assumptions, analytical procedures and conventional usage of qualitative textual analysis methods and TM. Our findings show that for content and classification methods, embedding TM into research design can partially and, arguably, in some cases fully automate the analysis. Discourse and representation methods can be augmented with TM in sequential mixed-method research design.

Summing up, we see avenues for TM both in embedded and sequential mixed-method research design. This is in line with previous work on mixed-method research that has challenged the traditional assumption of there being a clear division between qualitative and quantitative methods. Scholarly capacity to craft a robust research design depends on researchers’ familiarity with specific techniques, their epistemological assumptions, and good knowledge of the phenomena that are being investigated to facilitate the substantial interpretation of the results. We expect this research to help identify and address the critical points, thereby assisting researchers in the development of novel mixed-method designs that unlock the potential of TM in qualitative textual analysis without compromising methodological robustness.

Blei, D. M. (2012) ‘Probabilistic topic models’, Communications of the ACM, 55(4), p. 77. Janasik, N., Honkela, T. and Bruun, H. (2009) ‘Text Mining in Qualitative Research’, Organizational Research Methods, 12(3), pp. 436–460.

Mills, K. A. (2017) ‘What are the threats and potentials of big data for qualitative research?’, Qualitative Research, p. 146879411774346.

Late-Breaking Work

Local Letters to Newspapers - Digital History Project

Heikki Kokko

University of Tampere, The Centre of Excellence in the History of Experiences (HEX)

The Local Letters to Newspapers is a digital history project of the Academy of Finland Centre of Excellence in the History of Experiences HEX (2018–2025), hosted by University of Tampere. The objective is to make a new kind of digital research material available from the 19th and the early 20th century Finnish society. The aim is to introduce a database of the readers' letters submitted to the Finnish press that could be studied both qualitatively and quantitatively. The database will allow analyzing the 19th and 20th century global reality through a case study of the Finnish society. It will enable a wide range of research topics and open a path to various research approaches, especially the study of human experiences.

Late-Breaking Work

Lessons Learned from Historical Pandemics. Using crowdsourcing 2.0 and Citizen Science to map the Spanish Flus spatial and social network.

Søren Poder

Aarhus City Archives

By Søren K. Poder MA. In history & Astrid Lykke Birkving, MA in intellectual History

Aarhus City Archvies | Redia a/s

In 1918 the World was struck by the most devastating disease in recorded history - today known as the Spanish Flu. In less than one year nearly two third of world’s population came down with influenza. Of which between forty and one hundred million people died.

The Spanish Flu in 1918 did not originated in Spain, but most likely on the North American east coast in February 1918. By the middle of Marts, the influenza had spread to most of the overcrowded American army camps from where it soon was carried to the trenches in France and the rest of the World. This part of the story is well known. In contrast the diffusion of the 1918-pandemic, and the seasonal epidemics for that matter, on the regional and local level is still largely obscure. For instance, an explanation on why epidemics evidently tends to follow significantly different paths in different urban areas that otherwise seems to share a common social, commercial and cultural profile, tend to be more theoretical then based on evidence. For one sole reason – the lack of adequate data.

As part of the incessantly scientific interest in historical epidemics, the purpose of this research project is to identify the social, economic and cultural preconditions that most likely determines a given type of locality’s ability to spread or halter an epidemic’s hieratical diffusion.

Crowdsourcing 2.0

To meet ends data large amounts of data from a variety of different historical sources as to be collected and linked together. To do this we use traditional crowdsourcing techniques, where volunteers participates in transcribing different historical documents. Death certificates, census, patient charts etc. But just as important does the collected transcription form the base for a text recognition ML module that in time will be able recognize specific entities in a document – persons, placers, diagnoses dates ect.

Late-Breaking Work

Analysing Swedish Parliamentary Voting Data

Jacobo Rouces, Nina Tahmasebi, Lars Borin, Stian Rødven Eide

University of Gothenburg,

We used publicly available data from voting sessions in the Swedish Parliament to represent each member of parliament (MP) as a vector in a space defined by their voting record between the years 2014 and 2017. We then applied matrix factorization techniques that enabled us to find insightful projections of this data. Namely, it allowed the assessment of the level of clustering of MPs according to their party line while at the same time identifying MPs whose voting record is closer to other parties'. It also provided a data-driven multi-dimensional political compass that allows to ascertain similitudes and differences between MPs and political parties. Currently, the axes of the compass are unlabeled and therefore they lack a clear interpretation, but we plan to apply language technology on the parliamentary discussions associated to the voting sessions on order to identify the topics associated to these axis.

Late-Breaking Work

Automated Cognate Discovery in the Context of Low-Resource Sami Languages

Eliel Soisalon-Soininen, Mika Hämäläinen

University of Helsinki

1 Introduction

The goal of our project is to automatically find candidates for etymologically related words, known as cognates, for different Sami languages. At first, we will focus on North Sami, South Sami and Skolt Sami nouns by comparing their inflectional forms with each other. The reason why we look at the inflections is that, in Uralic languages, it is common that there are changes in the word stem when the word is inflected in different cases. When finding cognates, the non-nominative stems might reveal more about a cognate relationship in some cases. For example, the South Sami word for arm, g ̈ıete, is closer to the partitive of the Finnish word k ̈att ̈a than to the nominative form k ̈asi of the same word.

The fact that a great deal of previous work already exists related to etymolo- gies of words in different Sami languages [2, 4, 8] provides us with an interesting test bed for developing our automatic methods. The results can easily be vali- dated against databases such as A ́lgu [1] which incorporates results of different studies in Sami etymology in a machine-readable database.

With the help of a gold corpus, such as A ́lgu, we can perfect our method to function well in the case of the three aforementioned Sami languages. Later, we can expand the set of languages used to other Uralic languages such as Erzya and Moksha. This is achievable as we are basing our method on the data and tools developed in the Giellatekno infrastructure [11] for Uralic languages. Giellatekno has a harmonized set of tools and dictionaries for around 20 different Uralic languages allowing us to bootstrap more languages into our method.

2 Related Work

In historical linguistics, cognate sets have been traditionally identified using the comparative method, the manual identification of systematic sound corre- spondences across words in pairs of languages. Along with the rapid increase in digitally available language data, computational approaches to automate this process have become increasingly attractive.

Computationally, automatic cognate identification can be considered a prob- lem of clustering similar strings together, according to pairwise similarity scores given by some distance metric. Another approach to the problem is pairwise classification of word pairs as cognates or non-cognates. Examples of common distance metrics for string comparison include edit distance, longest common subsequence, and Dice coefficient.

The string edit distance is often used as a baseline for word comparison, measuring word similarity simply as the amount of character or phoneme in- sertions, deletions, and substitutions required to make one word equivalent to the other. However, in language change, certain sound correspondences are more likely than others. Several methods rely on such linguistic knowledge by convert- ing sounds into sound classes according to phonetic similarity [?]. For example, [15] consider a pair of words to be cognates when they match in their first two consonant classes.

In addition to such heuristics, a common approach to automatic cognate identification is to use edit distance metrics using weightings based on previ- ously identified regular sound correspondences. Such correspondences can also be learned automatically by aligning the characters of a set of initial cognate pairs [3,7]. In addition to sound correspondences, [14] and [6] also utilise se- mantic information of word pairs, as cognates tend to have similar, though not necessarily equivalent, meaning. Another method heavily reliant on prior lin- guistic knowledge is the LexStat method [9], requiring a sound correspondence matrix, and semantic alignment.

However, in the context of low-resource languages, prior linguistic knowledge such as initial cognate sets, semantic information, or phonetic transcriptions are rarely available. Therefore, cognate identification methods applicable to low- resource languages calls for unsupervised approaches. For example, [10] address this issue by investigating edit distance metrics based on embedding characters into a vector space, where character similarity depends on the set of characters they co-occur with. In addition, [12] investigate several unsupervised approaches such as hidden Markov models and pointwise mutual information, while also combining these with heuristic methods for improved performance.

3 Corpus

The initial plan is to base our method on the nominal XML dictionaries for the three Sami languages available on the Giellatekno infrastructure. Apart from just translations, these dictionaries contain also additional lexical information to a varying degree. The additional information which might benefit our re- search goals are cognate relationships, semantic tags, morphological information, derivation and example sentences.

For each noun the noun dictionaries, we produce a list of all its inflections in different grammatical numbers and cases. This is done by using a Python library called Uralic NLP [5], specialized in NLP for Uralic languages. Uralic NLP uses FSTs (finite-state-transducers) from the Giellatekno infrastructure to produce the different morphological forms.

We are also considering a possibility of including larger text corpora in these languages as a part of our method for finding cognates. However, theses languages

have notoriously small corpora available, which might render them insufficient for our purposes.

4 Future Work

Our research is currently at its early stages. The immediate future task is to start implementing different methods based on the previous research to solve the problem. We will first start with edit distance approaches to see what kind of information those can reveal and move towards a more complex solution from there.

A longer-term future plan is to include more languages into the research. We are also interested in a collaboration with linguists who could take a more qualitative look at the cognates found by our method. This will nourish inter- disciplinary collaboration and exchange of ideas between scholars of different backgrounds.

We are also committed to releasing the results produced by our method to a wider audience to use and profit from. This will be done by including the results as a part of the XML dictionaries in the Giellatekno infrastructure and also by releasing them in an open-access MediaWiki based dictionary for Uralic languages [13] developed in the University of Helsinki.

References

1. A ́lgu-tietokanta. saamelaiskielten etymologinen tietokanta (Nov 2006), http://kaino.kotus.fi/algu/

2. Aikio, A.: The Saami loanwords in Finnish and Karelian. Ph.D. thesis, University of Oulu, Faculty of Humanities (2009)

3. Ciobanu, A.M., Dinu, L.P.: Automatic detection of cognates using orthographic alignment. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). vol. 2, pp. 99–105 (2014)

4. Ha ̈kkinen, K.: Suomen kirjakielen saamelaiset lainat. Teoksessa Sa ́mit, sa ́nit, sa ́tneha ́mit. Riepmoˇca ́la Pekka Sammallahtii miessema ́nu 21, 161–182 (2007)

5. Ha ̈ma ̈la ̈inen, M.: UralicNLP (Jan 2018), https://doi.org/10.5281/zenodo.1143638, doi: 10.5281/zenodo.1143638

6. Hauer, B., Kondrak, G.: Clustering semantically equivalent words into cognate sets in multilingual lists. In: Proceedings of 5th international joint conference on natural language processing. pp. 865–873 (2011)

7. Kondrak, G.: Identification of cognates and recurrent sound correspondences in word lists. TAL 50(2), 201–235 (2009)

8. Koponen, E.: Lappische lehnwo ̈rter im finnischen und karelischen. Lapponica et Uralica. 100 Jahre finnisch-ugrischer Unterricht an der Universita ̈t Uppsala. Vortra ̈ge am Jubil ̈aumssymposium 20.–23. April 1994 pp. 83–98 (1996)

9. List,J.M.,Greenhill,S.J.,Gray,R.D.:Thepotentialofautomaticwordcomparison for historical linguistics. PloS one 12(1), e0170046 (2017)

10. McCoy, R.T., Frank, R.: Phonologically informed edit distance algorithms for word alignment with low-resource languages. Proceedings of

11. Moshagen, S.N., Pirinen, T.A., Trosterud, T.: Building an open-source develop- ment infrastructure for language technology projects. In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16. pp. 343–352. No. 85, Linkping University Electronic Press; Linkpings universitet (2013)

12. Rama, T., Wahle, J., Sofroniev, P., Ja ̈ger, G.: Fast and unsupervised methods for multilingual cognate clustering. arXiv preprint arXiv:1702.04938 (2017)

13. Rueter, J., Ha ̈m ̈al ̈ainen, M.: Synchronized mediawiki based analyzer dictionary development. In: Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages. pp. 1–7 (2017)

14. St Arnaud, A., Beck, D., Kondrak, G.: Identifying cognate sets across dictionaries of related languages. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pp. 2519–2528 (2017)

15. Turchin, P., Peiros, I., Murray, G.M.: Analyzing genetic connections between lan- guages by matching consonant classes. Vestnik RGGU. Seriya ”Filologiya. Voprosy yazykovogo rodstva”, (5 (48)) (2010)

Late-Breaking Work

Dissertations from Uppsala University 1602-1855 on the internet

Anna Cecilia Fredriksson

Uppsala University, Uppsala University Library

At Uppsala University Library, a long-term project is under way which aims at making the dissertations, that is theses, submitted at Uppsala University in 1602-1855 easy to find and read on the Internet. The work includes metadata production, scanning and OCR processing as well as publication of images of the dissertations in full-text searchable pdf files. So far, approximately 3,000 dissertations have been digitized and made accessible on the Internet via the DiVA portal, Uppsala University’s repository for research publications. All in all, there are about 12,000 dissertations of about 20 pages each on average to be scanned. This work is done by hand, due to the age of the material. The project aims to be completed in 2020.

Why did we prioritize dissertations?

Even before the project started, dissertations were valued research material, and the physical dissertations were frequently on loan. Their popularity was primarily due to the fact that generally, studying university dissertations is a great way to study evolvements and changes in society. In the same way as doctoral theses do today, the older dissertations reflect what was going on in the country, at the University, and in the intellectual Western world on the whole at a certain period of time. The great mass of them makes them especially suitable for comparative and longitudinal studies, and provides excellent chances for scholars to find material little used or not used at all in previous research.

Swedish older dissertations including those of today’s Finland specifically are also comparatively easy to find. In contrast to many other European libraries with an even longer history, collectors published bibliographies of Swedish dissertations as far back as 250 years ago. Our dissertations are also organized, bound and physically easily accessible. Last year the cataloguing of the Uppsala dissertations was completed according to modern standards in LIBRIS. That made them searchable according to subject and word in title, which was not possible before. All this made the digitization process smoother than that of many other kinds of cultural heritage material. The digital publication of the dissertations naturally made access to them even easier for University staff and students as well as lifelong learners in Sweden and abroad.

How are the dissertations used today?

In actual research today, we see that the material is frequently consulted in all fields of history. Dissertations provide scholars in the fields of history of ideas and history of science with insight into the status of a certain subject matter in Sweden in various periods of time, often in relation to the contemporary discussion on the European continent. The same goes for studies in history of literature and history of religion. Many of the dissertations examine subjects that remain part of the public debate today, and are therefore of interest for scholars in the political and social sciences. The languages of the dissertations are studied by scholars of Semitic, Classical and Scandinavian languages, and the dissertations often contain the very first editions and translations of certain ancient manuscripts in Arabic and Runic script. There is also a social dimension of the dissertations worthy of attention, as dedications and gratulatory poems in the dissertations mirror social networks in the educated stratum of Sweden in various periods of time. Illustrations in the dissertations were often made by local artists or the students themselves, and the great mass of gratulatory poems mirrors the less well-known side of poetry in early modern Sweden.

Our users

The users of the physical items are primarily university scholars, primarily our own University, but there is also quite a great deal of interest from abroad. Not least from our neighboring country Finland and from the Baltic States, which were for some time within the Swedish realm. Many projects are going on right now which include our dissertations as research material or which have them as their primary source material; Swedish projects as well as international. As Sweden as a part of learned Europe more or less shared the values, objects and methods of the Western academic world as a whole, to study Swedish science and scholarship is to study an important part of Western science and scholarship.

As for who uses our digital dissertations, we in fact do not know. The great majority of the dissertations are written in Latin, as in all countries of Europe and North America, Latin was the vehicle for academic discussion in the early modern age. In the first half of the 19th century, Swedish became more common in the Uppsala dissertations. Among the ones digitized and published so far, a great deal are in Swedish. As for the Latin ones, they too are clearly much used. Although knowledge of Latin is quite unusual in Sweden, foreign scholars in the various fields of history often had Latin as part of their curriculum. Obviously, our users know at least enough Latin to recognize if a passage treats the topic of their interest. They can also identify which documents are important to them and extract the most important information from it. If the document is central, it is possible to hire a translator.

But we believe that we also reach out to the lifelong learners, or the so-called “ordinary people”. The older dissertations examine every conceivable subject and they offer pleasant reading even for non-specialists, or people who use the Internet for genealogical research. The full text publication makes the dissertation show up, perhaps unexpectedly, when a person is looking for a certain topic or a certain word. Whoever the users the digital publication of the dissertations has been well received, and far beyond expectations. The first three test years of approximately 2,500 digitized dissertations published resulted in close to one million visits and over 170,000 downloads, i.e. over 4,700 per month. Even if we don’t – or perhaps because we don’t – either offer or demand advanced technologies for the use of these dissertations.

The digital publication and the new possibilities for research

The database in which the dissertations are stored and presented is the same database in which researchers, scholars and students of Uppsala University, and other Swedish universities, too, currently register their publications with the option to publish them digitally. This clears a path for new possibilities for researchers to become aware of and study the texts. Most importantly, it enables users to find documents in their field, spanning a period of 400 years in one search session. A great deal of the medical terms of diseases and body parts, chemical designations, and, of course, juridical and botanical terms are Latin and the same as were used 400 years ago, and can thus be used for localizing text passages on these topics. But the form of the text can be studied, too. Linguists would find it useful to make quantitative studies of the use of certain words or expressions, or just to find the words of interest for further studies. The usefulness of full-text databases are all known to us. But often one as a user gets either a well-working search system or a great mass of important texts, and seldom both. This problem is solved here by the interconnection between the publication database DiVA and the Swedish National Research Library System LIBRIS. The combination makes it possible to use an advanced search system with high functionality, thus reducing the Internet problem of too many irrelevant hits. It gives direct access to the digital full text in DiVA, and the option to order the physical book if the scholar needs to see the original at our library. Not least important, there is qualified staff appointed to care for the system’s long-term maintenance and updates, as part of their everyday tasks at the University Library. Also, the library is open for discussion with users.

The practical work within the project and related issues

As part of the digitization project, the images of the text pages are OCR-processed in order to create searchable full-text pdf files. The OCR process gives various results depending on the age and the language of the text. The OCR processing of dissertations in Swedish and Latin from ca. 1800 onwards results in OCR texts with a high degree of accuracy, that is, between 80 and 90 per cent, whereas older dissertations in Latin and in languages written in other alphabets will contain more inaccuracies. On this point we are not satisfied. Almost perfect results when it comes to the OCR-read text, or proof-reading, is a basic requirement for the full use and potential of this material. However, in this respect, we are dependent upon the technology which is available on the market, as this provides the best and safest product. These products were not developed for handling printing types of various sorts and sizes from the 17th and 18th centuries, and the development of these techniques, except when it comes to “Fraktur”, is slow or non-existing.

If you want to pursue further studies of the documents, you can download the documents for free to your own computer. There are free programs on the Internet that help you merge several documents of your choice into one document, in order for you to be able to search through a certain mass of text. If you are searching for something very particular, you could of course also perform a word search in Google. One of our wishes for the future is to make it possible for our users to search in several documents of their specific choice at one time, without them having to download the documents to their computer.

So, most important for us today within the dissertation project:

1) Better OCR for older texts

2) Easier ways to search in a large text mass of your own choice.

Future use and collaboration with scholars and researchers

The development of digital techniques for the further use of these texts is a future desideratum. We therefore aim to increase our collaboration with researchers who want to explore new methods to make more out of the texts. However, we always have to take into account the special demands from society when it comes to the work we, as an institute of the state, are conducting – in contrast to the work conducted by e.g. Google Books or research projects with temporary funding.

We are expected to produce both images and metadata of a reasonably high quality – a product that the University can ‘stand for’. What we produce should have a lasting value – and ideally be possible to use for centuries to come.

What we produce should be compatible with other existing retrieval systems and library systems. Important, in my opinion, is reliability and citability. A great problem with research on digitally borne material is, in my opinion, that it constantly changes, with respect to both their contents and where to find them. This puts the fundamental principle of modern science, the possibility to control results, out of the running. This is a challenge for Digital Humanities which, with the current pace of development, surely will be solved in the near future.

Late-Breaking Work

Normalizing Early English Letters for Neologism Retrieval

Mika Hämäläinen, Tanja Säily, Eetu Mäkelä

University of Helsinki

Introduction

Our project studies social aspects of innovative vocabulary use in early English letters. In this abstract we describe the current state of our method for detecting neologisms. The problem we are facing at the moment is the fact that our corpus consists of non-normalized text. Therefore, spelling normalization is the first step we need to solve before we can apply automatic methods to the whole corpus.

Corpus

We use CEEC (Corpora of Early English Correspondence) [9] as the corpus for our research. The corpus consists of letters ranging from the 15th century to the 19th century and it represents a wide social spectrum, richly documented in the metadata associated with the corpus, including information on e.g. socioeconomic status, gender, age, domicile and the relationship between the writer and recipient.

Finding Neologisms

In order to find neologisms, we use the information of the earliest attestation of words recorded in the Oxford English Dictionary (OED) [10]. Each lemma in the OED has information about its attestations, but also variant spelling forms and inflections.

How we proceed in automatically finding neologism candidates is as follows. We get a list of all the individual words in the corpus, and we retrieve their earliest attestation from the OED. If we find a letter where the word has been used before the earliest attestation recorded in the OED, we are dealing with a possible neologism, such as the word "monotonous" in (1), which antedates the first attestation date given in the OED by two years (1774 vs. 1776).

(1) How I shall accent & express, after having been so long cramped with the monotonous impotence of a harpsichord! (Thomas Twining to Charles Burney, 1774; TWINING_017)

The problem, however, is that our corpus consists of texts written in different time periods, which means that there is a wide range of alternative spellings for words. Therefore, a great part of the corpus cannot be directly mapped to the OED.

Normalizing with the Existing Methods

Part of the CEEC (from the 16th century onwards) has been normalized with VARD2 [3] in a semi-automated manner; however, the automatic normalization is only applied to sufficiently frequent words, whereas neologisms are often rare words. We take these normalizations and extrapolate them over the whole corpus. We also used MorphAdorner [5] to produce normalizations for the words in the corpus. After this, we compared the newly normalized forms with those in the OED taking into account the variant forms listed in the OED. NLTK's [4] lemmatizer was used to produce lemmas from the normalized inflected forms to map them to the OED. In doing so, we were able to map 65,848 word forms of the corpus to the OED. However, around 85,362 word forms still remain without mapping to the OED.

Different Approaches

For the remaining non-normalized words, we have tried a number of different approaches.

- Rules

- SMT

- NMT

- Edit distance, semantics and pronunciation

The simplest one of them is running the hand-written VARD2 normalization rules for the whole corpus. These are simple replacement rules that replace a sequence of characters with another one either in the beginning, end or middle of a word. An example of such a rule is replacing "yes" with "ies" at the end of the word.

We have also trained a statistical machine translation model (with Moses [7]}) and a neural machine translation model (with OpenNMT [6]). SMT has previously been used in the normalization task, for example in [11]. Both of the models are character based treating the known non-normalized to normalized word pairs as two languages for the translation model. The language model used for the SMT model is the British National Corpus (BNC) [1].

One more approach we have tried is to compare the non-normalized words to the ones in the BNC by Levenshtein edit distance [8]. This results in long lists of normalization candidates, that we filter further by their semantic similarity, which means comparing the list of two word appearing immediately after and before the non-normalized word and the normalization candidates picking out the candidates with largest number of shared contextual words. And finally, filtering this list with Soundex pronunciation by edit distance. A similar method [2] has been used in the past for normalization which relied on the semantics and edit distance.

The Open Question

The above described methods produce results of varying degrees of success. However, none of them is reliable enough to be trusted above the rest. We are now in a situation in which at least one of the approaches finds the correct normalization most of the time. The next unsolved question is how to pick the correct normalization from the list of alternatives in an accurate way.

Once the normalization has been solved, we are facing another problem which is mapping words to the OED correctly. For example, currently the verb "to moon" is mapped to the noun "mooning" recorded in the OED because it appeared in the present participle form in the corpus. This means that in the future, we have to come up with ways to tackle not only the problem of homonyms, but also the problem of polysemy. A word might have acquired a new meaning in one of our letters, but we cannot detect this word as a neologism candidate, because the word has existed in the language in a different meaning before.

References

1. The British National Corpus, version 3 (BNC XML Edition). Distributed by Bodleian Libraries, University of Oxford, on behalf of the BNC Consortium (2007),http://www.natcorp.ox.ac.uk/

2. Amoia, M., Martinez, J.M.: Using comparable collections of historical texts forbuilding a diachronic dictionary for spelling normalization. In: Proceedings of the7th workshop on language technology for cultural heritage, social sciences, andhumanities. pp. 84–89 (2013)

3. Baron, A., Rayson, P.: VARD2: a tool for dealing with spelling variation in histor-ical corpora (2008)

4. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’ReillyMedia (2009)

5. Burns, P.R.: Morphadorner v2: A java library for the morphological adornment ofEnglish language texts. Northwestern University, Evanston, IL (2013)

6. Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.M.: OpenNMT: Open-SourceToolkit for Neural Machine Translation. ArXiv e-prints

7. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N.,Cowan, B., Shen, W., Moran, C., Zens, R., et al.: Moses: Open source toolkit forstatistical machine translation. In: Proceedings of the 45th annual meeting of theACL on interactive poster and demonstration sessions. pp. 177–180. Associationfor Computational Linguistics (2007)

8. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, andreversals. In: Soviet physics doklady. vol. 10, pp. 707–710 (1966)

9. Nevalainen,T.,Raumolin-Brunberg,H.,Ker ̈anen,J.,Nevala,M.,Nurmi, A., Palander-Collin, M.: CEEC, Corpus of Early English Cor-respondence. Department of Modern Languages, University of Helsinki,http://www.helsinki.fi/varieng/CoRD/corpora/CEEC/

10. OED: OED Online. Oxford University Press, http://www.oed.com/

11. Pettersson, E., Megyesi, B., Tiedemann, J.: An SMT approach to automatic an-notation of historical text. In: Proceedings of the workshop on computational his-torical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALTProceedings Series 18. pp. 54–69. No. 087, Link ̈oping University Electronic Press(2013)

Conference Agenda