Digital Humanities in the Nordic Countries 3rd Conference

4:00pm - 4:30pm
Long Paper (20+10min) [abstract]

Diplomatarium Fennicum and the digital research infrastructures for medieval studies

Seppo Eskola, Lauri Leinonen

National Archives of Finland,

Digital infrastructures for medieval studies have advanced in strides in Finland over the last few years. Most literary sources concerning medieval Finland − the Diocese of Åbo − are now available online in one form or another: Diplomatarium Fennicum encompasses nearly 7 000 documentary sources, the Codices Fennici project recently digitized over 200 mostly well-preserved pre-17th century codices and placed them online, and Fragmenta Membranea contains digital images of 9 300 manuscript leaves belonging to over 1 500 fragmentary manuscripts. In terms of availability of sources, the preconditions for research have never been better. So, what’s next?

This presentation discusses the current state of digital infrastructures for medieval studies and their future possibilities. For the past two and a half years the presenters have been working on the Diplomatarium Fennicum webservice, published in November 2017, and the topic is approached from this background. Digital infrastructures are being developed on many fronts in Finland: several memory institutions are actively engaged (the three above-mentioned webservices are developed and hosted by the National Archives, The Finnish Literature Society, and the National Library respectively) and many universities have active medieval studies programs with an interest in digital humanities. Furthermore, interest in Finnish digital infrastructures is not restricted to Finland as Finnish sources are closely linked to those of other Nordic countries and the Baltic Sea region in general. In our presentation, we will compare the different Finnish projects, highlight opportunities for international co-operation, and discuss choices (e.g. selecting metadata models) that could best support collaboration between different services and projects.

4:30pm - 4:45pm
Short Paper (10+5min) [publication ready]

The HistCorp Collection of Historical Corpora and Resources

Eva Pettersson, Beáta Megyesi

Uppsala University

We present the HistCorp collection, a freely available open platform aiming at the distribution of a wide range of historical corpora and other useful resources and tools for researchers and scholars interested in the study of historical texts. The platform contains a monitoring corpus of historical texts from various time periods and genres for 14 European languages. The collection is taken from well-documented historical corpora, and distributed in a uniform, standardised format. The texts are downloadable as plaintext, and in a tokenised format. Furthermore, some texts are normalised with regard to spelling, and some are annotated with part-of-speech and syntactic structure. In addition, preconfigured language models and spelling normalisation tools are provided to allow the study of historical languages.

4:45pm - 5:00pm
Short Paper (10+5min) [publication ready]

Semantic National Biography of Finland

Eero Hyvönen^1,2, Petri Leskinen¹, Minna Tamper^1,2, Jouni Tuominen^2,1, Kirsi Keravuori³

¹Aalto University; ²University of Helsinki (HELDIG); ³Finnish Literature Society (SKS)

This paper presents the idea and project of transforming and using

the textual biographies of the National Biography of Finland, published by the

Finnish Literature Society, as Linked (Open) Data. The idea is to publish the lives as semantic, i.e., machine “understandable” metadata in a SPARQL endpoint in the Linked Data Finland (LDF.fi) service, on top of which various Digital Humanities applications are built. The applications include searching and studying individual personal histories as well as historical research of groups of persons using methods of prosopography. The basic biographical data is enriched by extracting events from unstructured texts and by linking entities internally and to external data sources. A faceted semantic search engine is provided for filtering groups of people from the data for research in Digital Humanities. An extension of the event-based CIDOC CRM ontology is used as the underlying data model, where lives are seen as chains of interlinked events populated from the data of the biographies and additional sources, such as museum collections, library databases, and archives.

5:00pm - 5:15pm
Short Paper (10+5min) [abstract]

Creating a corpus of communal court minute books: a challenge for digital humanities

Maarja-Liisa Pilvik¹, Gerth Jaanimäe¹, Liina Lindström¹, Kadri Muischnek¹, Kersti Lust²

¹University of Tartu, Estonia,; ²The National Archives of Estonia, Estonia

This paper presents the work of a digital humanities project concerned with the digitization of Estonian communal court minute books. The local communal courts in Estonia came into being through the peasant laws of the early 19th century and were the first instance class-specific courts, that tried peasants. Rather than being merely judicial institutions, the communal courts were at first institutions for the self-government of peasants, since they also dealt with police and administrative matters. After the municipal reform of 1866, however, the communal courts were emancipated from the noble tutelage and the court became a strictly judicial institution, that tried peasants for their minor offences and solved their civil disputes, claims and family matters. The communal courts in their earlier form ceased to exist in 1918, when Estonia became independent from the Russian rule.

The National Archives of Estonia holds almost 400 archives of communal courts from the pre-independence period. They have been preserved very unevenly and not all of them include minute books. The minute books themselves are also written in an inconsistent manner, the earlier minute books are often written in German and the writing is strongly dependent on the skills and will of the parish clerk. However, the materials from the period starting with the year 1866, when the creation of the minute books became more systematic, are a massive and rich source shedding light on the everyday lives of the peasantry. Still, at the moment, the users of the minute books meet serious difficulties in finding relevant information since there are no indexes and one has to go through all the materials manually. The minute books are also a fascinating resource for linguists, both dialectologists and computational linguists: the books contain regional varieties tied to specific genre and early time period (making it possible to detect linguistic expressions, which are rare in atlases, for example, and also in dialect corpus, which represents language from about 100 years later) while also being a written resource, reflecting the writing traditions of the old spelling system. This is also what makes these texts complex and challenging for automatic analysis methods, which are otherwise quite well-established in contemporary corpus linguistics.

In our talk we present a project dealing with the digitization and analysis of the minute books from the period between 1866 and 1890. The texts were first digitized in the 2000s and preserved in a server in html-format, which is good for viewing, but not as good for automatic processing. After the server crashed, the texts were rescued via web archives and the structure of the minute books was used to convert the documents automatically into a more functional format using xml-markup and separating the body text with tags referring to information about the titles, dates, indexes, participants, content and topical keywords, which indicate the purview of the communal courts in that period.

We discuss the workflow of creating a digital resource in a standardized and maximally functional format as well as challenges, such as automatic text processing for cleaning and annotating the corpus in order to distinguish the relevant layers of information. In order to enable queries with different degrees of specificity in the corpus, the texts also need to be linguistically analyzed. For both named entity recognition (NER), which enables network analysis and links the events described in the materials to geospatial locations, and morphological annotation, which makes it possible to perform queries based on lemmas or grammatical information, we have applied the Estnltk library in Python, which is developed for contemporary written standard Estonian. For NER, its performance was satisfactory, i.e. it found recognized names well, even though it systematically overrecognized organization names. The most complicated issue so far is the morphological analysis and disambiguation of word forms. Tools developed for Estonian morphological analysis, such as Estnltk or Vabamorf, are trained on contemporary written standard Estonian. Communal court minute books, however, include language variants, which are a mixture of dialectal language, inconsistent spelling and the old spelling system. In the presentation, we introduce the results of our first attempts to apply Estnltk tools to the materials of communal court minute books, the problems that we’ve run into, and provide solutions for overcoming these problems.

The final aim of the project is to create a multifunctional source, which could be of interest for researchers of different fields within the humanities. As the National Archives have a considerable amount of communal court minute books, which are thus far only in a scanned form, the digitized minute books collection is planned to expand using crowdsourcing oportunities.

References:

Estnltk. Open source tools for Estonian natural language processing; https://estnltk.github.io/estnltk/1.2/#.

Vabamorf. Eesti keele morfanalüsaator [‘The morphological analyzer of Estonian’]; https://github.com/Filosoft/vabamorf.

Digital Humanities in the Nordic Countries
3rd Conference

7–9 March 2018, Helsinki

Conference Agenda