Conference Agenda

Session Overview
Location: PIV
 
Date: Wednesday, 07/Mar/2018
4:00pm - 5:30pmW-PIV-1: Infrastructure and Support
Session Chair: Tanja Säily
PIV 
 
4:00pm - 4:30pm
Long Paper (20+10min) [publication ready]

Towards an Open Science Infrastructure for the Digital Humanities: The Case of CLARIN

Koenraad De Smedt1, Franciska De Jong2, Bente Maegaard3, Darja Fišer4, Dieter Van Uytvanck2

1University of Bergen, Norway; 2CLARIN ERIC, The Netherlands; 3University of Copenhagen, Denmark; 4University of Ljubljana and Jožef Stefan Institute, Slovenia

CLARIN is the European research infrastructure for language resources. It is a sustainable home for digital research data in the humanities and it also of-fers tools and services for annotation, analysis and modeling. The scope and structure of CLARIN enable a wide range of studies and approaches, in-cluding comparative studies across regions, periods, languages and cul-tures. CLARIN does not see itself as a stand-alone facility, but rather as a player in making the vision that is underlying the emerging European poli-cies towards Open Science a reality, interconnecting researchers across na-tional and discipline borders by offering seamless access to data and ser-vices in line with the FAIR data principles. CLARIN also aims contribute to responsible data science by the design as well as the governance of its in-frastructure and to achieve an appropriate and transparent division of re-sponsibilities between data providers, technical centres, and end users. CLARIN offers training towards digital scholarship for humanities scholars and aims at increased uptake from this audience.


4:30pm - 4:45pm
Short Paper (10+5min) [abstract]

The big challenge of data! Managing digital resources and infrastructures for digital humanities researchers

Isto Huvila

Uppsala University,

Digital humanities research is dependent on the development and seizing of appropriate digital methods and technologies, collection and digitisation of data, and development of relevant and practicable research questions. In the long run, the potential of the field to sustain as a significant social intellectual movement (or in Kuhnian terms, paradigm) is, however, conditional to the sustainability of the scholarly practices in the field. Digital humanities research has already moved from early methodological experiments to the systematic development of research infrastructures. These efforts are based both on the explicit needs to develop new resources for digital humanities research and on the strategic initiatives of the keepers of relevant existing collections and datasets to open up their holdings for users. Harmonisation and interoperability of the evolving infrastructures are in different stages of developments both nationally and internationally but in spite of the large number of practical difficulties, the various national, European (e.g. DARIAH, CLARIN and ARIADNE) and international initiatives are making progress in this respect. The sustainability of digital infrastructures is another issue that has been scrutinised and addressed both in theory and practice under the auspices of national data archives, specialist organisations like the British Digital Curation Centre and international discussions, for instance, within the iPRES conference community. However, an aspect of the management of the infrastructures that has received relatively little attention so far, is management for use. We are lacking a comprehensive understanding of how the emerging digital data and infrastructures are used, could be used and consequently, how the emanating resources should be managed to be useful for digital humanities research not only in the context within which they were developed but also for other researchers and many cases users outside of the academia.

This paper discusses the processes and competences for the management of digital humanities resources and infrastructures for (theoretically) maximising their current and future usefulness for the purposes of research. On the basis of empirical work on archaeological research data in the context of the Swedish Archaeological Information in the Digital Society (ARKDIS) research project (Huvila, 2014) and a comparative study with selected digital infrastructures in other branches of humanities research, a model of use-oriented management of research data with central processes and competences is presented. The suggested approach complements existing digital curation and management models by opening up the user side processes of digital humanities data resources and their implications for the functioning, development and management of appropriate research infrastructures. Theoretically the approach draws from the records continuum theory (as formulated by Upward and colleagues (e.g. Upward, 1996, 1997, 2000; McKemmish, 2001)) and Pickering’s notion of the mangle of practice (Pickering, 1995) developed in the context of the social studies of science. The model demonstrates the significance of being sensitive to explicit wants and needs of the researchers (users) but also the implicit, often tacit requirements that emerge from their practical research work. Simultaneously, the findings emphasise the need of a meta-competence to manage the data and provide appropriate services for its users.

References

Huvila, I. (Ed.) (2014). Perspectives to Archaeological Information in the Digital Society. Uppsala: Department of ALM, Uppsala University.

URL http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-240334

McKemmish, S. (2001). Placing Records Continuum Theory and Practice. Archival Science, 1(4), 333–359.

URL http://dx.doi.org/10.1023/A:1016024413538

Pickering, A. (1995). The Mangle of Practice: Time, Agency, and Science. Chicago: University of Chicago Press.

Upward, F. (1996). Structuring the Records Continuum Part One: Postcustodial Principles and Properties. Archives and Manuscripts, 24(2), 268– 285.

Upward, F. (1997). Structuring the Records Continuum, Part Two: Structuration Theory and Recordkeeping. Archives and Manuscripts, 25(1), 10–35.

Upward, F. (2000). Modelling the continuum as paradigm shift in recordkeeping and archiving processes, and beyond–a personal reflection. Records Management Journal, 10(3), 115–139.


4:45pm - 5:00pm
Short Paper (10+5min) [abstract]

Research in Nordic literary collections: What is possible and what is relevant?

Mads Rosendahl Thomsen1, Kristoffer Laigaard Nielbo2, Mats Malm3

1Aarhus University; 2University of Southern Denmark; 3University of Gothenburg

There are a growing number of digital literary collections in the Nordic countries that make the literary heritage accessible and have great potential for research that takes advantage of machine readable texts. These collections range from very large collections such as the Norwegian Bokhylla, medium-sized collections such as the Swedish Litteraturbanken and the Danish Arkiv for Dansk Litteratur, to one-author collections, e.g. the collected works of N.F.S. Grundtvig. In this presentation we will discuss some of the obstacles for a more widespread use of these collections by literary scholars and present outcomes of a series of seminars – UCLA 2015, Aarhus 2016, UCLA 2017 – sponsored by the Fondation Maison des sciences de l’homme courtesy of a grant from the Andrew Carnegie Mellon Foundation.

We find that there are two important thresholds in the use of collections:

1) The technical obstacles for collecting the right corpora and applying the appropriate tools for analysis are too high for the majority of researchers working in literary studies. While much have been done to advance the access to works, differences in formats and metadata make it difficult to work across the collections. Our project has addressed this issue by creating a Nordic github repository for literary texts, CLEAR, which provides cleaned versions of Nordic literary works, as well as a suite of tools in Python.

2) The capacity to combine traditional hermeneutical approaches to literary studies with computational approaches is still in its infancy despite numerous good studies from the past years, e.g. by Stanford Literary Lab, Leonard and Tangherlini and Ted Underwood. We have worked to bring together in our series of seminar scholars with great technical prowess and more traditionally trained literary scholars in a series of seminars to generate projects that are technically feasible and scholarly relevant. The process of expanding the methodological vocabulary of literary studies is complicated and requires significant domain expertise to verify the outcome of computational analyses, and conversely, openness to work with results that cannot be verified by close readings. In this presentation we will show how thematic variation and readability can provide new perspectives on Swedish and Danish modernist literature, and discuss how this relates to more general visions of literary studies in an age of computation (Heise, Thomsen).

Literature

Algree-Hewitt, Mark et al. 2016. ”Canon/Archive. Large-scale Dynamics in the Literary Field.” Stanford Literary Lab Pamphlet 11.

Heise, Ursula. 2017. “Comparative literature and computational criticism: A conversation with Franco Moretti.” Futures of Comparative Literature: ACLA State of the Discipline Report. London: Routledge, 2017.

Leonard, Peter and Timothy R. Tangherlini. 2013. “Trawling in the Sea of the Great Unread: Sub-Corpus Topic Modeling and Humanities Research”. Poetics 41(6): 725-749.

Thomsen, Mads Rosendahl et al. 2015. “No Future without Humanities.” Humanities 1.

Underwood, Ted. 2013. Why Literary Period Mattered. Stanford: Stanford University Press.


5:00pm - 5:30pm
Long Paper (20+10min) [publication ready]

Reassembling the Republic of Letters - A Linked Data Approach

Jouni Tuominen1,2, Eetu Mäkelä1,2, Eero Hyvönen1,2, Arno Bosse3, Miranda Lewis3, Howard Hotson3

1Aalto University, Semantic Computing Research Group (SeCo); 2University of Helsinki, HELDIG – Helsinki Centre for Digital Humanities; 3University of Oxford, Faculty of History

Between 1500 and 1800, a revolution in postal communication allowed ordinary men and women to scatter letters across and beyond Europe. This exchange helped knit together what contemporaries called the respublica litteraria, Republic of Letters, a knowledge-based civil society, crucial to that era’s intellectual breakthroughs, and formative of many modern European values and institutions. To enable effective Digital Humanities research on the epistolary data distributed in different countries and collections, metadata about the letters have been aggregated, harmonised, and provided for the research community through the Early Modern Letters Online (EMLO) service. This paper discusses the idea and benefits of using Linked Data as a basis for the next digital framework of EMLO, and presents experiences of a first demonstrational implementation of such a system.

 

 
Date: Thursday, 08/Mar/2018
11:00am - 12:30pmT-PIV-1: Newspapers
Session Chair: Mats Malm
PIV 
 
11:00am - 11:30am
Long Paper (20+10min) [publication ready]

A Study on Word2Vec on a Historical Swedish Newspaper Corpus

Nina Tahmasebi

Göteborgs Universitet,

Detecting word sense changes can be of great interest in the field of digital humanities. Thus far, most investigations and automatic methods have been developed and carried out on English text and most recent methods make use of word embeddings. This paper presents a study on using Word2Vec, a neural word embedding method, on a Swedish historical newspaper collection. Our study includes a set of 11 words and our focus is the quality and stability of the word vectors over time. We investigate to which extent a word embedding method like Word2Vec can be used effectively on texts where the volume and quality is limited.


11:30am - 11:45am
Short Paper (10+5min) [abstract]

A newspaper atlas: Named entity recognition and geographic horizons of 19th century Swedish newspapers

Erik Edoff

Umeå University

What was the outside world for 19th century newspaper readers? That is the overarching problem investigated in this paper. One way of facing this issue is to investigate what geographical places that was mentioned in the newspaper, and how frequently. For sure, newspapers were not the only medium that contributed to 19th century readers’ notion of the outside world. Public meetings, novels, sermons, edicts, travelers, photography, and chapbooks are other forms of media that people encountered with a growing regularity during the century; however, newspapers often covered the sermons, printed lists of travelers and attracted readers with serial novels. This means, at least to some extent, that these are covered in the newspapers columns. And after all, the newspapers were easier to collect and archive than a public meeting, and thus makes it an accessible source for the historian.

Two newspapers, digitized by the National Library of Sweden, are analyzed: Tidning för Wenersborgs stad och län (TW) and Aftonbladet (AB). They are chosen based on their publishing places’ different geographical and demographical conditions as well as the papers’ size and circulation. TW was founded in 1848 in the town of Vänersborg, located on the western shore of lake Vänern, which was connected with the west coast port, Göteborg, by the Trollhätte channel, established in 1800. The newspaper was published in about 500 copies once a week (twice a week from 1858) and addressed a local and regional readership. AB was a daily paper founded in Stockholm in 1830 and was soon to become the leading liberal paper of the Swedish capital, with a great impact on national political discourse. For its time, it was widely circulated (between 5,000 and 10,000 copies) in both Stockholm and the country as a whole. Stockholm was an important seaport on the eastern coast. These geographic distinctions probably mean interesting differences in the papers’ respective outlook. The steamboats revolutionized travelling during the first half of the century, but its glory days had passed around 1870, and was replaced by railways as the most prominent way of transporting people.

This paper is focusing on comparing the geographies of the two newspapers by analyzing the places mentioned in the periods 1848–1859 and 1890–1898. The main railroads of Sweden were constructed during the 1860s, and the selected years therefore cover newspaper geographies before and after railroads.

The main questions of paper addresses relate to media history and history of media infrastructure. During the second half of the 19th century several infrastructure technologies were introduced and developed (electric telegraph, postal system, newsletter corporations, railways, telephony, among others). The hypothesis is that these technologies had an impact on the newspapers’ geographies. The media technologies enabled information to travel great distances in short timespans, which could have homogenizing effects on newspaper content, which is suggested by a lot of traditional research (Terdiman 1999). On the other hand, digital historical research has shown that the development of railroads changed the geography of Houston newspapers, increasing the importance of the near region rather than concentrating geographic information to national centers (Blevins 2014).

The goal of the study is in other words to investigate what these the infrastructural novelties introduced during the course of the 19th century as well as the different geographic and demographic conditions meant for the view of the outside world or the imagined geographies provided by newspapers. The aim of the paper is therefore twofold: (1) to investigate a historical-geographical problem relating to newspaper coverage and infrastructural change and (2) to tryout the use of Named Entity Recognition on Swedish historical newspaper data.

Named Entity Recognition (NER) is a software that is designed to locate and tag entities, such as persons, locations, and organizations. This paper uses SweNER to mine the data for locations mentioned in the text (Kokkinakis et al. 2014). Earlier research has emphasized the problems with bad OCR-scanning of historical newspapers. A picture of a newspaper page is read by an OCR-reading software and converted into a text file. The result contains a lot of misinterpretations and therefore considerable amount of noise (Jarlbrink & Snickars 2017). This is a big obstacle when working with digital tools on historical newspapers. Some earlier research has used and evaluated the performance of different NER-tools on digitized historical newspapers, also underlining the OCR-errors as the main problem with using NER on such data (Kettunen et al. 2017). SweNER has also been evaluated in tagging named entities in historical Swedish novels, where the OCR problems are negligible (Borin et al 2007). This paper, however, does not evaluate the software’s result in a systematic way, even though some important biases have been identified by going through the tagging of some newspaper copies manually. Some important geographic entities are not tagged by SweNER at all (e.g. Paris, Wien [Vienna], Borås and Norge [Norway]). SweNER is able to pick up some OCR-reading mistakes, although many recurring ones (e.g. Lübeck read as Liibeck, Liibcck, Ltjbeck, Ltlbeck) are not tagged by SweNER. These problems can be handled, at least to some degree, by using “leftovers” from the data (wrongly spelled words) that was not matched in a comparison corpus. I have manually scanned the 50,000 most frequently mentioned words that was not matched in the comparative corpus, looking for wrongly spelled names of places. I ended up with a list of around 1,000 places and some 2,000 spelling variations (e.g. over 100 ways of spelling Stockholm). This manually constructed list could be used as a gazetteer, complementing the NER-result, giving a more accurate result of the 19th century newspaper geographies.

REFERENCES

Blevins, C. (2014), ”Space, nation, and the triumph of region: A view on the world from Houston”, Journal of American History, Vol. 101, no 1, pp. 122–147.

Borin, L., Kokkinakis, D., and Olsson, L-G. (2007), “Naming the past: Named entity and animacy recognition in 19th century Swedish literature”, Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007), pp. 1–8, available at: http://spraakdata.gu.se/svelb/pblctns/W07-0901.pdf (accessed October 31 2017).

Jarlbrink, J. and Snickars, P. (2017), “Cultural heritage as digital noise: Nineteenth century newspapers in the digital archive”, Journal of Documentation, Vol. 73, no 6, pp. 1228–1243.

Kettunen, K., Mäkelä, E., Ruokolainen, T., Kuokkala, J., and Löfberg, L. (2017), ”Old content and modern tools: Searching named entities in a Finnish OCRed historical newspaper collection 1771–1910”, Digital Humanities Quarterly, (preview) Vol. 11, no 3.

Kokkinakis, D., Niemi, J., Hardwick, S., Lindén, K., and Borin, L., (2014), ”HFST-SweNER – A new NER resource for Swedish”, Proceedings of the 9th edition of the Language Resources and Evaluation Conference (LREC), Reykjavik 26–31 May 2014., pp. 2537-2543

Terdiman, R. (1999) “Afterword: Reading the news”, Making the news: Modernity & the mass press in nineteenth-century France, Dean de la Motte & Jeannene M. Przyblyski (eds.), Amherst: University of Massachusetts Press.


11:45am - 12:00pm
Short Paper (10+5min) [abstract]

Digitised newspapers and the geography of the nineteenth-century “lingonberry rush” in Finland

Matti La Mela

History of Industrialization & Innovation group, Aalto University,

This paper uses digitised newspaper data for analysing practices of nature use. In the late nineteenth century, a “lingonberry rush” developed in Sweden and Finland due to the growing foreign demand and exports of lingonberries. The Finnish newspapers followed carefully the events in the neighbouring Sweden, and reported on their pages about the export tons and the economic potential this red gold could have also for Finland. The paper is interested in the geography of this “lingonberry rush”, and explores how the unprecise geographic information about berry-picking can be gathered and structured from the digitised newspapers (metadata and NER). The paper distinguishes between unique and original news, and longer chains of news reuse. The paper takes use of open tools for named-entity recognition and text reuse detection. This geospatial analysis adds to the reinterpretation of the history of the Nordic allemansrätten, a tradition of public access to nature, which allows everyone to pick wild berries today; the circulation of commercial news on lingonberries in the nineteenth century enforced the idea of berries as a commodity, and ultimately facilitated to portray the wild berries as an openly accessible resource.


12:00pm - 12:15pm
Short Paper (10+5min) [abstract]

Sculpting Time: Temporality in the Language of Finnish Socialism, 1895–1917

Risto Turunen

University of Tampere,

Sculpting Time: Temporality in the Language of Finnish Socialism, 1895–1917

The Grand Duchy of Finland had relatively the biggest socialist party of Europe in 1907. The breakthrough of Finnish socialism has not yet been analyzed from the perspective of ‘temporality’, that is, the way human beings experience time. Socialists constructed their own versions of the past, present and future that differed from competing Christian and nationalist perceptions of time. This paper examines socialist experiences and expectations primarily by a quantitative analysis of Finnish handwritten and printed newspapers. Three main questions will be solved by combining traditional conceptual-historical approaches with corpus-linguistic methods of collocation, keyness and key collocation. First, what is the relation of the past, present and future in the language of socialism? Second, what are the key differences between socialist temporality and non-socialist temporalities of the time? Third, how do the actual revolutionary moments of 1905 and 1917 affect socialist temporality and vice versa: did the revolution of time in the consciousness of the people lead to the time of revolution in Finland? The hypothesis is that identifying the changes in temporality will improve our historical understanding of the political ruptures in Finland in the early twentieth century. The results will be compared to Reinhart Koselleck’s theory of ‘temporalization of concepts’ – expectations towards future supersede experiences of the past in modernity –, and to Benedict Anderson’s theory of ‘imagined communities’ which suggests that the advance of print capitalism transformed temporality from vertical to horizontal. The paper forms a part of my on-going dissertation project, which merges close reading of archival sources with computational distant reading of digital materials, thus producing a macro-scale picture of the political language of Finnish socialism.


12:15pm - 12:30pm
Short Paper (10+5min) [abstract]

Two cases of meaning change in Finnish newspapers, 1820-1910

Antti Kanner

University of Helsinki,

In Finland the 19th century saw the formation of number of state institutions that came to define the political life of the Grand Duchy and of subsequent independent republic. Alongside legal, political, economical and social institutions and organisations, the Modern Finnish, as institutionally standardised language, can be seen in this context as one of these institutions. As the majority of residents of Finland were native speakers of Finnish dialects, adopting Finnish was necessary for state’s purposes in extending its influence within the borders of autonomous Grand Duchy. Obviously widening domains of use of Finnish played also an important role in the development of Finnish national identity. In the last quarter of 19th century Finnish started to gain ground as the language of administrative, legal and political discourses alongside Swedish. It is this period we find the crucial conceptual processes that shape Finnish political history well into 20th century.

In this paper I will present two related case studies from my doctoral research, where I seek to understand the semantic similarity scores of so-called Semantic Vector Spaces obtained from large historical corpora in terms of linguistic semantics. As historical corpora are collections of past speech acts, view they provide to changing meanings of words is as much influenced by pragmatic factors and writers’ intentions as synchronic semantics. Understanding and explicating the historical context of observed processes is essential when studying temporal dynamics in semantic changes. For this end, I will try to reflect the theoretical side of my work in the light of cases of historical meaning changes. My research falls under the heading of Finnish Language, but is closely related to history and computational linguistics.

The main data for my research comes from the National Library of Finland’s Newspaper Collection, which I use via KORP service API provided by Language Bank of Finland. The collection accessible via the API contains nearly all newspapers and periodicals published in Finland from 1771 to 1910. The collection is however very heterogenous, as the press and other forms of printed public discourse in Finnish only developed in Finland during the 19th century. Historical variation in conventions of typesetting, editing and orthography as well as paper quality used for printing make it very difficult for OCR systems to recognize characters with 100 percent accuracy. Kettunen et. al. estimated that OCR accuracy is actually somewhere between 60 and 80 percent. However, not all problems in the automatic recognition of the data come from OCR problems or even historical spelling variation. Much is also due to linguistic factors: the 19th century saw large scale dialectal, orthographical and lexical variation in written Finnish. To exemplify the scale of variation, when a morphological analyser for Modern Finnish (OMORFI, Pirinen 2015) was used, it could only parse around 60 percent of the wordlist of the Corpus of Early Modern Finnish (CEMF).

For the reason of unreliability of results from automated parser and the temporal heterogeneity inherent in the data, conducting the study with methodology robust for these kinds of problems poses a challenge. The approach chosen was to use number of analysis and see whether their results could be combined to produce a coherent view of the historical change in word use. In addition, simpler and more robust analysis were chosen instead of more advanced and elaborated ones. For example, analysis similar to topic modelling was conducted using second order collocations (Bertels & Speelman 2014 and Heylen, Wielfaerts, Speelman, Geeraerts 2014) instead of algorithms like LDA (Blei, Ng & Young 2004), that are widely used for this purpose. This was because the data contains an highly inflated count of individual types and lemmas resulting from the problems with OCR and morphological analysis. It seemed that in this specific case at least, LDA was not able to produce sensible topics because the number of hapax legomena per text was so high. The analysis applied based on second order collocations aimed not at producing a model of system of topics, as the LDA does, but to simply cluster studied word’s collocating words based on their respective similarities. Also when tracking changes in words’ syntactic positioning tendencies, instead of resource intensive syntactic parsing, that is also sensitive to errors in data, simple morphological case distribution was used. When the task is to track signals of change, morphological case distributions can be used as sufficient proxies for dependency distributions. This can be done on the grounds that the case selection in Finnish is mostly governed by syntax, as case selection is used to express syntactic relations between, for example constituents of nominal phrases or predicate verb and its arguments (Vilkuna 1989).

The first of my case studies focuses on Finnish word maaseutu. Maaseutu denotes rural area but is in Modern Finnish mostly used as a collective singular referring to the rural as a whole. It is most commonly used as an opposite to the urban, which is often lexicalised as kaupunki, the city, used in similar collective meaning. However, after its introduction to Finnish in 1830’s maaseutu was used in variety of related meanings, mostly referring to specific rural areas or communities, until the turn of the century, when the collective singular sense had become dominant. Starting roughly from 1870’s, however, there seems to have been a period of contesting uses. At that time we find a number of vague cases where the meanings generic or collective and specific meanings overlap.

Combining information from my analysis to newspaper metadata yields an image of dynamic situation. The emergence of the collective singular stands out clearly and is being connected to an accompanying discourse of negotiating urban-rural relations on a national instead of regional level. This change can be pinpointed quite precisely to 1870’s and to the newspapers with geographically wider circulation and more national identity.

The second word of interest is vaivainen, an adjective referring to a person or a thing either being of wretched or inadequate quality or suffering from an physical or mental ailment. When used as a noun, it refers to a person of very low and excluded social status and extreme poverty. In this meaning the word appears in Modern Finnish mostly in poetically archaic or historical contexts but has disappeared from vocabulary of social policy or social legislation already in the early 20th century. The word has a biblical background, being used in older Finnish Bible translations, in for example Sermon on the Mount (as the equivalent of poor in Matt. 5:13 “blessed are the poor in spirit”), and as such was natural choice to name the recipients of church charities. When the state poverty relief system started to take its form in the mid 19th century, it built on top of earlier church organizations (Von Aerschot 1996) and the church terminology was carried over to the state institutions.

When tracking the contexts of the word over the 19th century using context word clusters based on second order collocations, two clear discoursal trends appear: the poverty relief discourse that already in the 1860’s is pronounced in the data disperses into a complex network of different topics and discoursive patterns. As the state run poverty relief institutions become more complex and more efficiently administered, the moral foundings of the whole enterprise are discussed alongside reports of everyday comings and goings of individual institutions or, indeed, tales of individual relief recipients fortunes. The other trend involves the presence of religious or spiritual discourse which, against preliminary assumptions does not wane into the background but experiences a strong surge in the 1870’s and 1880’s. This can be explained in part by growth of revivalist Christian publications in the National Library Corpus, but also by intrusion of Christian connotations in the political discussion on poverty relief system. It is as if the word vaivainen functions as a kind of lightning rod of Christian morality into the public poverty relief discourse.

While methodological contributions of this paper are not highly ambitious in terms of language technology or computational algorithms used, the selection of analysis presents an innovative approach to Digital Humanities. The aim here has been to combine not just one, but an array of simple and robust methods from computational linguistics to theoretical background and analytical concepts from lexical semantics. I argue that robustness and simplicity of methods makes the overall workflow more transparent, and this transparency makes it easier to interpret the results in wider historical context. This allows to ask questions whose relevance is not confined to computational linguistics or lexical semantics, but expands to wider areas of Humanities scholarship. This shared relevance of questions and answers, to my understanding, lies at the core of Digital Humanities.

References

Bertels, A. & Speelman, D. (2014). “Clustering for semantic purposes. Exploration of semantic similarity in a technical corpus.” Terminology 20:2, pp. 279–303. John Benjamins Publishing Company.

Blei, D., Ng, A. Y. & Jordan, M. I. (2003). “Latent Dirichlecht Allocation.” Journal of Machine Learning Research 3 (4–5). Pp. 993–1022.

CEMF, Corpus of Early Modern Finnish. Centre for Languages in Finland. http://kaino.kotus.fi

Heylen, C., Peirsman Y., Geeraerts, D. & Speelman, D. (2008). “Modelling Word Similarity: An Evaluation of Automatic Synonymy Extraction Algorithms.” Proceedings of LREC 2008.

Huhtala, H. (1971). Suomen varhaispietistien ja rukoilevaisten sanankäytöstä :

semanttis-aatehistoriallinen tutkimus. [On the vocabulary of the early Finnish pietist

and revivalist movements]. Suomen Teologinen Kirjallisuusseura.

Kettunen, K., Honkela, T., Lindén, K., Kauppinen, P., Pääkkönen, T. & Kervinen, J.

(2014). “Analyzing and Improving the Quality of a Historical News Collection

using Language Technology and Statistical Machine Learning Methods”. In

IFLA World Library and Information Congress Proceedings : 80th IFLA

General Conference and Assembly. Lyon. France.

Pirinen, T. (2015). “Omorfi—Free and open source morphological lexical database for

Finnish”. In Proceedings of the 20th Nordic Conference of Computational

Linguistics NODALIDA 2015.

Vilkuna, M. (1989). Free word order in Finnish: Its syntax and discourse functions.

Suomalaisen Kirjallisuuden Seura.

Von Aerschot, P. (1996). Köyhät ja laki: toimeentukilainsäädännön kehittyminen kehitys

oikeudellistusmisprosessien valossa. [The poor and the law: development of Finnish

welfare legislation in light juridification processes.] Suomalainen Lakimiesyhdistys.

 
2:00pm - 3:30pmT-PIV-2: Authorship
Session Chair: Jani Marjanen
PIV 
 
2:00pm - 2:30pm
Long Paper (20+10min) [abstract]

Extracting script features from a large corpus of handwritten documents

Lasse Mårtensson1, Anders Hast2, Ekta Vats2

1Högskolan i Gävle, Sweden,; 2Uppsala universitet, Sweden

Before the advent of the printing press, the only way to create a new piece of text was to produce it by hand. The medieval text culture was almost exclusively a handwritten one, even though printing began towards the very end of the Middle Ages. As a consequence of this, the medieval text production is very much characterized by variation of various kinds: regarding language forms, regarding spelling and regarding the shape of the script. In the current presentation, the shape of the script is in focus, an area referred to as palaeography. The introduction of computers has changed this discipline radically, as computers can handle very large amounts of data and furthermore measure features that are difficult to deal with for a human researcher.

In the current presentation, we will demonstrate two investigations within digital palaeography, carried out on the medieval Swedish charter corpus in its entirety, to the extent that this has been digitized. The script in approximately 14 000 charters has been measured and accounted for, regarding aspects described below. The charters are primarily in Latin and Old Swedish, but there are also a few in Middle Low German. The overall purpose for the investigations is to search for script features that may be significant from the perspective of separating one scribe from another, i.e. scribal attribution. As the investigations have been done on the entire available charter corpus, it is possible to visualize how each separate charter relates to all the others, and furthermore to see how the charters may divide themselves into clusters on the basis of similarity regarding the investigated features.

The two investigations both focus on aspects that have been looked upon as significant from the perspective of scribal attribution, but that are very difficult to measure, at least with any degree of precision, without the aid of computers. One of the investigations belongs to a set of methods often referred to as Quill Features. This method focuses, as the name states, on how the scribe has moved the pen over the script surface (parchment or paper). The medieval pen, the quill, consisted of a feather that had been hardened, truncated and split at the top. This construction created variation in width in the strokes constituting the script, mainly depending on the direction in which the pen was moved, and also depending on the angle in which the scribe had held the pen. This is what this method measures: the variation between thick and thin strokes, in relation to the angle of the pen. This method has been used on medieval Swedish material before, namely a medieval Swedish manuscript (Cod. Ups. C 61, 1104 pages), but the current investigation accounts for ten times the size of the previous investigation, and furthermore, we employ a new type of evaluation (see below) of the results that to our knowledge has not been done before.

The second investigation focuses on the relations between script elements of different height, and the proportions between these. For instance three different formations can be discerned among the vertical scripts elements: minims (e.g. in ‘i’, ‘n’ and ‘m’), ascenders (e.g. in ‘b’, ‘h’ and ‘k’) and descenders (e.g. in ‘p’ and ‘q’). The ascender can extend to a various degree above the minim, and the descender can extend to a various degree below the minim, creating different proportions between the components. These measures have also been extracted from the entire available medieval Swedish charter corpus, and display very interesting information from the perspective of scribal identity. It should be noted that the first line of a charter often is divergent from the rest of the charter in this respect, as the ascenders here often extends higher than otherwise. In a similar way, the descenders of the last line of the charters often extend further down below the line as compared to the rest of the charter. In order for a representative measure to be gained from a charter, these two lines must be disregarded.

One of the problems when investigating individual scribal habits in medieval documents is that we rarely know for certain who has produced them, which makes the evaluation difficult. In most cases, the scribe of a given document is identified through a process of scribal attribution, usually based on palaeographical and linguistic evidence. In an investigation on individual scribal features, it is not desirable to evaluate the results on the basis of previous attributions. Ideally, the evaluation should be done on charters where the identity of the scribe can be established on external features, where his/her identity is in fact known. For this purpose, we have identified a set of charters where this is actually the case, namely where the scribe himself/herself explicitly states that he/she has held the pen (in our corpus, there are only male scribes). These charters contain a so-called scribal note, containing the formula ego X scripsi (‘I X wrote’), accompanied by a symbol unique to this specific scribe. One such scribe is Peter Tidikesson, who produced 13 charters with such a scribal note in the period 1432–1452, and another is Peter Svensson, who produced six charters in the period 1433–1453. This selection of charters is the means by which the otherwise big data-focused computer aided methods can be evaluated from a qualitative perspective. This step of evaluation is crucial in order for the results to become accessible and useful for the users of the information gained.


2:30pm - 3:00pm
Long Paper (20+10min) [abstract]

Text Reuse and Eighteenth-Century Histories of England

Ville Vaara1, Aleksi Vesanto2, Mikko Tolonen1

1University of Helsinki; 2University of Turku

Introduction

- ----

What kind of history is Hume’s History of England? Is it an impartial account or is it part of a political project? To what extent was it influenced by seventeenth-century Royalist authors? These questions have been asked since the first Stuart volumes were published in the 1750s. The consensus is that Hume’s use of Royalist sources left a crucial mark on his historical project. However, as Mark Spencer notes, Hume did not only copy from Royalists or Tories. One aim of this paper is to weigh these claims against our evidence about Hume’s use of historical sources. To do this we qualified, clustered and compared 129,646 instances text reuse in Hume’s History. Additionally, we are able to compare Hume’s History of England to other similar undertakings in the eighteenth-century and get an accurate view of their composition. We aim to extend the discussion on Hume's History in the direction of applying computation methods on understanding the writing of history of England in the eighteenth-century as a genre.

This paper contributes to the overall development of Digital Humanities by demonstrating how digital methods can help develop and move forward discussion in an existing research case. We don’t limit ourselves to general method development, but rather contribute in the specific discussions on Hume’s History and study of eighteenth-century histories.

Methods and sources

- ----

This paper advances our understanding of the composition of Hume’s History by examining the direct quotes in it based on data in Eighteenth-Century Collections Online (ECCO). It should be noted that ECCO also includes central seventeenth-century histories and other important documents reprinted later. Thus, we do not only include eighteenth-century sources, but, for example, Clarendon, Rushworth and other notable seventeenth-century historians. We compare the phenomenon of text reuse in Hume’s History to that in works of Rapin, Guthrie and Carte, all prominent historians at the time. To our knowledge, this kind of text mining effort has not been not been previously done in the field of historiography.

Our base-text for Hume is the 1778 edition of History of England. For Paul de Rapin we used the 1726-32 edition of his History of England. For Thomas Carte the source was the 1747-1755 edition of his General History of England. And for William Guthrie we used the 1744-1751 edition of his History of Great Britain.

As a starting point for our analysis, we used a dataset of linked text-reuse fragments found in ECCO. The basic idea was to create a dataset that identifies similar sequences of characters (from circa 150 to more than 2000 characters each) instead of trying to match individual characters or tokens/words. This helped with the optical character recognition problems that plague ECCO. The methodology has previously been used in matching DNA sequences, where the problem of noisy data is likewise present. We further enriched the results with bibliographical metadata from the English Short Title Catalogue (ESTC). This enriching allows us to compare the publication chronology and locations, and to create rough estimates of first edition publication dates.

There is no ready-to-use gold standard for text reuse cluster detection. Therefore, we compared our clusters and the critical edition of the Essay Concerning Human Understanding (EHU) to see if text reuse cases of Hume’s Treatise in EHU are also identified by our method. The results show that we were able to identify all cases included in EHU except those in footnotes. Because some of the changes that Hume made from the Treatise to EHU are not evident, this is a very promising.

Analysis

- ----

To give a general overview of Hume’s History in relation to other works considered, we compared their respective volumes of source text reuse (figure 1). The comparison reveals some fundamental stylistic and structural differences. Hume’s and Carte’s Histories are composed quite differently from Rapin’s and Guthrie’s, which have roughly three times more reused fragments: Rapin typically opens a chapter with a long quote from a source document, and moves on to discuss the related historical events. Guthrie writes similarly, quoting long passages from sources of his choice. Humeis different: His quotes are more evenly spread, and a greater proportion of the text seems to be his own original formulations.

[Figure 1.]

Change in text reuse in the Histories

- ----

All the histories of England considered in our analysis are massive works, comprising of multiple separate volumes. The amount of reused text fragments found in these volumes differs significantly, but the trends are roughly similar. The common overall feature is a rise in the frequency of direct quotes in later volumes.

The increase in text reuse peaks in the volumes covering the reign of Charles I, and the events of the English Civil War, but with respect to both Hume and Rapin (figures 2 & 3), the highest peak is not at the end of Charles’ reign, but in the lead up to the confrontation with the parliament. In Guthrie and Carte (figures 4 & 5) the peaks are located in the final volume. Except for Guthrie, all the other historical works considered here have the highest reuse rates located around the period of Charles I’s reign that was intensely debated topic among Hume’s contemporaries.

[Figure 2.]

[Figure 3.]

[Figures 4, 5.]

We can further break down the the sources of reused text fragments by political affiliation of their authors (figure 6). A significant portion of the detected text reuse cases by Hume link to authors with no strong political leaning in the wider Whig-Tory context. It is obvious that serious antiquary work that is politically neutral forms the main body of seventeenth-century historiography in England. With the later volumes, the amount of text reuses cases tracing back to authors with a political affiliation increases, as might be expected with more heavily politically loaded topics.

[Figure 6.]

Taking an overview of the authors of the text reuse fragments in Hume’s History (figure 7), we note that the statistics are dominated by a handful of writers, with a long “tail” of others whose use is limited to a few fragments. Both groups, the Whig and Tory authors, feature a few “main sources” for Hume. John Rushworth (1612-1690) emerges as the most influential source, followed closely by Edward Hyde Clarendon (1609-1674). Both Rushworth and Clarendon had reached a position of prominence as historians and were among the best known and respected sources available when Hume was writing his own work. We might even question if their use was politically colored at all, as practically everyone was using their works, regardless of political stance.

[Figure 7.]

Charles I execution and Hume’s impartiality

- ----

A relatively limited list of authors are responsible for majority of the text fragments in Hume's History. As one might intuitively expect, the use of particular authors is concentrated in particular chapters. In general, the unevenness in the use of quotes can be seen as more of a norm than an exception.

However, there is at least one central chapter in Hume’s Stuart history that breaks this pattern. That is, Chapter LIX - perhaps the most famous chapter in the whole work, covering the execution of Charles I. Nineteenth-century Whig commentators argued, with great enthusiasm, that Hume’s use of sources, especially in this particular chapter, and Hume’s description of Charles’s execution, followed Royalist sources and the Jacobite Thomas Carte in particular. Thus, more carefully balanced use of sources in this particular chapter reveals a clear intention of wanting to be (or appear to be) impartial on this specific topic (figure 8).

Of course, there is John Stuart Mill’s claim that Hume only uses Whigs when they support his Royalist bias. In the light of our data, this seems unlikely. If we compare Hume's use of Royalist sources in his treatment of the execution of Charles I to Carte, Carte’s use of Royalists, statistically, is off the chart whereas Hume’s is aligned with his use of Tory sources elsewhere in the volume.

[Figure 8.]

Hume’s influence on later Histories

- ----

A final area of interest in terms of text reuse is what it can tell us about an author’s influence on later writers. The reuse totals of Hume’s History in works following its publication are surprisingly evenly spread out over all the volumes (figure 9), and in this respect differ from the other historians considered here (figures 10 - 12). The only exception is the last volume where a drop in the amount of detected reuse fragments can be considered significant.

Of all the authors only Hume has a significant reuse arising from the volumes discussing the Civil War. The reception of Hume’s first Stuart volume, the first published volume of his History is well known. It is notable that the next volumes published, that is the following Stuart volumes, and possibly written with the angry reception of the first Stuart volume in mind, are the ones that seem to have given rise to least discussion.

[Figure 9.]

[Figure 10.]

[Figures 11 & 12.]

Bibliography

- ----

Original sources

- ----

Eighteenth-century Collections Online (GALE)

English Short-Title Catalogue (British Library)

Thomas Carte, General History of England, 4 vols., 1747-1755.

William Guthrie, History of Great Britain, 3 vols., 1744-1751.

David Hume, History of England, 8 vols., 1778.

David Hume, Enquiry concerning Human Understanding, ed. Tom L. Beauchamp, OUP, 2000.

Paul de Rapin, History of England, 15 vols., 1726-32.

Secondary sources

- ----

Herbert Butterfield, The Englishman and his history, 1944.

John Burrow, Whigs and Liberals: Continuity and Change in English Political Thought, 1988.

Duncan Forbes, Hume’s Philosophical Politics, Cambridge, 1975.

James Harris, Hume. An intellectual biography, 2015.

Colin Kidd, Subverting Scotland's Past. Scottish Whig Historians and the Creation of an Anglo-British Identity 1689–1830, Cambridge, 1993.

Royce MacGillivray, ‘Hume's "Toryism" and the Sources for his Narrative of the Great Rebellion’, Dalhousie Review, 56, 1987, pp. 682-6.

John Stuart Mill, ‘Brodie’s History of the British Empire’, Robson et al. ed. Collected works, vol. 6, pp. 3-58. (http://oll.libertyfund.org/titles/mill-the-collected-works-of-john-stuart-mill-volume-vi-essays-on-england-ireland-and-the-empire)

Ernest Mossner, "Was Hume a Tory Historian?’, Journal of the History of Ideas, 2, 1941, pp. 225-236.

Karen O’Brien, Narratives of Enlightenment: Cosmopolitan History from Voltaire to Gibbon, CUP, 1997.

Laird Okie, ‘Ideology and Partiality in David Hume's History of England’, Hume Studies, vol. 11, 1985, pp. 1-32.

Frances Palgrave, ‘Hume and his influence upon History’ in vol. 9 of Collected Historical Works, e.d R. H. Inglis Palgrave, 10 vols. CUP, 1919-22.

John Pocock, Barbarism and religion, vols. 1-2.

B. A. Ring, ’David Hume: Historian or Tory Hack?’, North Dakota Quarterly, 1968, pp. 50-59.

Claudia Schmidt, Reason in history, 2010.

Mark Spencer, ‘David Hume, Philosophical Historian: “contemptible Thief” or “honest and industrious Manufacturer”?, Hume conference, Brown, 2017.

Vesanto, Nivala, Salakoski, Salmi & Ginter: A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora. Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24. May 2017, Gothenburg, Sweden. (http://www.ep.liu.se/ecp/131/049/ecp17131049.pdf)


3:00pm - 3:30pm
Long Paper (20+10min) [abstract]

Refutatio errorum – authorship attribution on a late-medieval antiheretical treatise

Reima Välimäki

University of Turku, Cultural history

Refutatio errorum – authorship attribution on a late-medieval antiheretical treatise.

Since Peter Biller’s attribution of the Cum dormirent homines (1395) to Petrus Zwicker, perhaps the most important late medieval inquisitor prosecuting Waldensians, the treatise has become a standard source on the late medieval German Waldensianism. There is, however, another treatise, known as the Refutatio errorum, which has gained far less attention. In my dissertation (2016) I proposed that similarities in style, contents, manuscript tradition and composition of the Refutatio errorum and the Cum dormirent homines are so remarkable that Petrus Zwicker can be confirmed as the author of both texts. The Refutatio exists in four different redactions. However, the redaction edited by J. Gretser in the 17th century, and consequently used by modern scholars, does not correspond to the earlier and more popular redaction that is in the majority of preserved manuscripts.

In the proposed paper I will add a new element of verification to Zwicker’s authorship: machine-learning-based computational authorship attribution applied in the digital humanities consortium Profiling Premodern Authors (University of Turku, 2016–2019). In its simplest form, the authorship attribution is a binary classification task based on textual features (word uni/bi-grams, character n-grams). In our case, the classifications are “Petrus Zwicker” (based on features from his known treatise) and “not-Zwicker”, based on features from a background corpus consisting of medieval Latin polemical treatises, sermons and other theological works. The test cases are the four redactions of the Refutatio errorum. Classifiers used include a linear Support Vector Machine and a more complex Convolutional Neural Network. Researchers from the Turku NLP group (Aleksi Vesanto, Filip Ginter, Sampo Pyysalo) are responsible for the computational analysis.

The paper contributes to the conference theme History. It aims to bridge the gap between authorship attribution based on qualitative analysis (e.g. contents, manuscript tradition, codicological features, palaeography) and computational stylometry. Computational methods are treated as one tool that contributes to the difficult task of recognising authorship in a medieval text. The study of author profiles of four different redactions of a single work contributes to the discussions on scribes, secretaries and compilers as authors of medieval texts (e.g. Reiter 1996, Minnis 2006, Connolly 2011, Kwakkel 2012, De Gussem 2017).

Bibliography:

Biller, Peter. “The Anti-Waldensian Treatise Cum Dormirent Homines of 1395 and its Author.” In The Waldenses, 1170-1530: Between a Religious Order and a Church, 237–69. Variorum Collected Studies Series. Aldershot: Ashgate, 2001.

Connolly, Margaret. “Compiling the Book.” In The Production of Books in England 1350-1500, edited by Alexandra Gillespie and Daniel Wakelin, 129–49. Cambridge Studies in Palaeography and Codicology 14. Cambridge ; New York: Cambridge University Press, 2011.

De Gussem, Jeroen. “Bernard of Clairvaux and Nicholas of Montiéramey: Tracing the Secretarial Trail with Computational Stylistics.” Speculum 92, no. S1 (2017): S190–225. https://doi.org/10.1086/694188.

Kwakkel, Erik. “Late Medieval Text Collections. A Codicological Typology Based on Single-Author Manuscripts.” In Author Reader Book: Medieval Authorship in Theory and Practice, edited by Stephen Partridge and Erik Kwakkel, 56–79. Toronto: University of Toronto Press, 2012.

Reiter, Eric H. “The Reader as Author of the User-Produced Manuscript: Reading and Rewriting Popular Latin Theology in the Late Middle Ages.” Viator 27, no. 1 (1996): 151–70. https://doi.org/10.1484/J.VIATOR.2.301125.

Minnis, A. J. “Nolens Auctor Sed Compilator Reputari: The Late-Medieval Discourse of Compilation.” In La Méthode Critique Au Moyen Âge, edited by Mireille Chazan and Gilbert Dahan, 47–63. Bibliothèque d’histoire Culturelle Du Moyen âge 3. Turnhout: Brepols, 2006.

Välimäki, Reima. “The Awakener of Sleeping Men. Inquisitor Petrus Zwicker, the Waldenses, and the Retheologisation of Heresy in Late Medieval Germany.” PhD Thesis, University of Turku, 2016.

 
4:00pm - 5:30pmT-PIV-3: Legal and Ethical Matters
Session Chair: Christian-Emil Smith Ore
PIV 
 
4:00pm - 4:30pm
Long Paper (20+10min) [abstract]

Breaking Bad (Terms of Service)? The DH-scholar as Villain

Pelle Snickars

Umea University,

For a number of years I have been heading a major research project on Spotify (funded by the Swedish Research Council). It takes a software studies and digital humanities approach towards streaming media, and engages in reverse engineering Spotify’s algorithms, aggregation procedures, metadata, and valuation strategies. During the summer of 2017 I received an email from a Spotify legal counsel who was ”concerned about information it received regarding methods used by the responsible group of researchers in this project. This information suggests that the research group systematically violated Spotify’s Terms of Use by attempting to artificially increase plays, among others, and to manipulate Spotify’s services with the help of scripts or other automated processes.” I responded politely—but got no answer. A few weeks later, I received a letter from the senior legal advisor at my university. Spotify had apparently contacted the Research Council with the claim that our research was questionable in a way that would demand “resolute action”, and the possible termination of financial and institutional support. At the time of writing it is unclear if Spotify will file a lawsuit—or start a litigation process.

DH-research is embedded in ’the digital’—and so are its methods, from scraping web content to the use of bots as research informants. Within scholarly communities centered on the study of the web or social media there is a rising awareness of the ways in which digital methods might be non-compliant with commercial Terms of Service (ToS)—a discussion which has not yet filtered out and been taken serious within the digital humanities. However, DH-researchers will in years to come increasingly have to ask themselves if their scholarly methods need to abide by ToS—or not. As social computing researcher Amy Bruckman has stated, it might have profound scholarly consequences: ”Some researchers choose not to do a particular piece of work because they believe they can’t violate ToS, and then another researcher goes and does that same study and gets it published with no objections from reviewers.”

My paper will recount my legal dealings with Spotify—including a discussion of the digital methods used in our project—but also more generally reflect around the ethical implications of collecting data in novel ways. ToS are contracts—not the law, still there is a dire need for ethical justifications and scholarly discussions why the importance of academic research justifies breaking ToS.


4:30pm - 4:45pm
Short Paper (10+5min) [abstract]

Legal issues regarding tradition archives: the Latvian case study.

Liga Abele, Anita Vaivade

Latvian Academy of Culture,

Historically, the tradition archives have had their course of development rather apart, both in form and in substance, from the formation process of other types of cultural heritage collections held by “traditional” archives, museums and libraries. However, for positive influence of current trends in development of technical and institutional capacities to be fully exercised on the managerial approaches, there must be increased legal certainty regarding status and functioning of the tradition archives. There are several trajectories through which tradition archives can be and are influenced by the surrounding legal and administrative framework both at national and regional level. A thorough knowledge of the impact from the existing regulatory base can contribute to informed decision making in consistence with the role that these archives play in safeguarding the intangible cultural heritage. In the paper a case study of the current Latvian situation would be presented within a broader regional perspective of the three Baltic states. The legal framework of interest is defined by the institutional status of tradition archives, the legal status of the collections, as well as legal provisions and restrictions regarding functioning (work) that involves gathering, processing and further use of the archive material.

The paper is based on the data gathered within the EUSBSR Seed Money Facility project “DigArch_ICH. Connecting Digital Archives of Intangible Heritage” (No. S86, 2016/2017) executed in partnership of the Institute of Literature, Folklore and Art of the University of Latvia, the Estonian Folklore Archives, the Estonian Literary Museum, the Norwegian Museum of Cultural History and the Institute for Language and Folklore in Sweden. One of the several thematic lines of the project dealt with legal and ethical issues, asking national experiences about legal concerns and restrictions for the work of tradition archives, legal status of collections of tradition archives, practice on signing written agreements between researcher and informant, as well as existing codes of ethics that would be applied to the work of tradition archives. Responses were received from altogether 21 institutions from 11 countries of the Baltic Sea region and neighbouring countries.

Fields of the legislation involved.

There are several fields of national legislation that influence the work of tradition archives, such as the regulations on intangible heritage, documentary heritage, work of memory institutions (archives, museums, libraries), intellectual property and copyright, as well as protection of personal data. Depending on the national legislative choices and contexts, these fields may be regulated by separate laws, or overarching laws (for instance, in the field of cultural heritage, covering both intangible as well as documentary heritage protection), or some of these fields may remain uncovered by the national legislation.

According to the results of the survey, the legal status of the tradition archives can be rather diverse. They can be part of larger institutions such as universities, museums, or libraries. In the Latvian situation, there are specific laws for every type of the above-mentioned institutions that can entail large variety of rule sets. The status of the collections can differ also depending on whether they are recognised as part of the protected national treasures (such as the national collections of museums, archives etc.). The ownership status can be rather divers, taking into consideration the collections belonging to the state or privately owned. Moreover, ownership rights of the same collection can be split between various types of owners of similar or varied legal status. The paper proposes to identify and analyse in the Latvian situation the consequences for the collections of the tradition archives depending on the institutional status of their holder, their ownership status and influence exercised by legislation in the fields of copyright and intellectual property law as well as data protection. The Latvian case would be put into perspective of the experience by the Estonian and the Lithuanian situation.

International influence on the national legislation.

The national legislation is influenced by the international normative instruments of different level, ranging from the global perspective (the UNESCO) to the regional level, in this case – the European scope. At the global level there are several instruments ranging from legally binding instruments to the “soft law”, such as the 2003 UNESCO Convention for the Safeguarding of the Intangible Cultural Heritage or the 2015 UNESCO Recommendation Concerning the Preservation of, and Access to, Documentary Heritage Including in Digital Form. Concerning the work of tradition archives, this 2003 Convention namely relates to the documentation of intangible cultural heritage as well as establishment of national inventories of such heritage, and in this regard tradition archives may have or establish their role in the national policy implementation processes. The European regional legislation and policy documents are of relevance, adopted either by the European Council, or by the European Union. They concern the field of cultural heritage (having a general direction towards an integrated approach towards various cultural heritage fields), as well as private data protection and copyright and intellectual property rights. The role of the legally binding legal instruments of the European Union, such as directives and regulations, would be examined through perspective of the national legislation related to the tradition archives

Aspects of deontology.

As varied deontological aspects affect functioning of the tradition archives, these issues will be examined in the paper. There are national codes of ethics that may apply to the work of tradition archives, either from the perspective of research or in relation to archival work. Within the field of intangible cultural heritage, the issues of ethics have been also debated internationally over the recent years, with recognised topicality as for different stakeholders involved. Thus, UNESCO Intergovernmental Committee for the Safeguarding of the Intangible Cultural Heritage adopted in 2015 the Ethical Principles for Safeguarding Intangible Cultural Heritage. This document has a role of providing recommendations to various persons and institutions that are part of safeguarding activities, and this concerns also the work of tradition archives. There are also international deontology documents that concern the work of archives, museums and libraries. These documents would be referred to in a complementary manner, taking into consideration the specificity of tradition archives. Namely, the 1996 International Council of Archives (ICA) Code of Ethics. Although this code of ethics does not highlight archives that deal with folklore and traditional culture materials, it nevertheless sets general principles for the archival work, as well as cooperation of archives, and puts an emphasis also to the preservation of the documentary heritage. Another important deontological reference for tradition archives concerns the work of museum, which is particularly significant for archives that function as units in larger institutions – museums. Internationally well-known and often mentioned reference is the 2004 (1986) International Council of Museums (ICOM) Code of Ethics for Museums. A reference may be given also to the 2012 International Federation of Library Associations (IFLA) Code of Ethics for Librarians and other Information Workers.


4:45pm - 5:00pm
Short Paper (10+5min) [abstract]

Where are you going, research ethics in Digital Humanities?

Sari Östman, Elina Vaahensalo, Riikka Turtiainen

University of Turku,

1 Background

In this paper we will examine the current state and future development of research ethics among Digital Humanities. We have analysed

a) ethics-focused inquiries with researchers in a multidisciplinary consortium project (CM)

b) Digital Humanities -oriented journals and

c) the objectives of the DigiHum Programme at the Academy of Finland, ethical guidelines of AoIR (Association of Internet Researchers. AoIR has published an extensive set of ethical guidelines for online research in 2002 and 2012) and academical ethical boards and committees, in particular the one at the University of Turku. We are planning on analysing the requests for comments which have not been approved in the ethical board at the Univ. of Turku. For that, we need a research permission from administration of University of Turku – which is in process.

Östman and Vaahensalo work in the consortium project Citizen Mindscapes (CM), which is part of the Academy of Finland’s Digital Humanities Programme. University Lecturer Turtiainen is using a percentage of her work time for the project.

In the Digital Humanities program memorandum, ethical examination of the research field is mentioned as one of the main objectives of the program (p. 2). The CM project has a work package for researching research ethics, which Östman is leading. We aim at examining the current understanding of ethics in multiple disciplines, in order to find some tools for more extensive ethical considerations especially in multidisciplinary environments. This kind of a toolbox would bring more transparency into multidisciplinary research.

Turtiainen and Östman have started developing the ethical toolbox for online research already in their earlier publications (see f. ex. Turtiainen & Östman 2013; Östman & Turtiainen 2016; Östman, Turtiainen & Vaahensalo 2017). The current phase is taking the research of the research ethics into more analytical level.

2 Current research

When we are discussing such a field of research as Digital Humanities, it is quite clear than online specific research ethics (Östman & Turtiainen 2016; Östman, Turtiainen & Vaahensalo 2017) plays on especially significant role in it. Research projects often concentrate on one source or topic with a multidisciplinary take: the understandings of research ethics may fundamentally vary even inside the same research community. Different ethical focal points and varying understandings could be a multidisciplinary resource, but it is essential to recognize and pay attention to the varying disciplinary backgrounds as well as the online specific research contexts. Only by taking these matters into consideration, we are able to create some functional ethical guidelines for multidisciplinary online-oriented research.

The Inquiries in CM24

On the basis of the two rounds of ethical inquiry within the CM24 project, the researchers seemed to consider most focal such ethical matters as anonymization, dependence on corporations, co-operation with other researchers and preserving the data. By the answers ethical views seemed to

a) individually constructed: the topic of research, methods, data plus the personal view to what might be significant

b) based on one’s education and discipline tradition

c) raised from the topics and themes the researcher had come in touch with during the CM24 project (and in similar multidisciplinary situations earlier)

One thing seemingly happening with current trend of big data usage, is that even individually produced online material is seen as mass; faceless, impersonalized data, available to anyone and everyone. This is an ethical discussion which was already on in the early 2000’s (see f. ex. Östman 2007, 2008; Turtiainen & Östman 2009) when researchers turned their interest in online material for the first time. It was not then, and it is not now, ethically durable research, to consider the private life- and everyday -based contents of individual people as ’take and run’ -data. However, this seems to be happening again, especially in disciplines where ethics has mostly focused on copyrights and maybe corporal and co-operational relationships. (In the CM24 for example information science seems to be one of the disciplines where intimate data is used as faceless mass.) Then again, a historian among the project argues in their answer, that already choosing an online discussion as an object to research is an ethical choice, ”shaping what we can and should count in into the actual research”.

Neither one of above-mentioned ethical views is faulty. However, it might be difficult for these two researchers to find a common understanding about ethics, in for example writing a paper together. A multifaceted, generalized collection of guidelines for multidisciplinary research would probably be of help.

Digital Humanities Journals and Publications

To explore ethics in digital humanities, we needed a diverse selection of publications to represent research in Digital Humanities. Nine different digital humanities journals were chosen for analysis, based on the listing made by Berkeley University. The focus in these journals varies from pedagogy to literary studies. However, they all are digital humanities oriented. The longest-running journal on the list has been published since 1986 and the most recent journals have been released for the first time in 2016. The journals therefore cover the relatively long-term history of digital humanities and a wide range of multi- and interdisciplinary topics.

In the journals and in the articles published in them, research ethics is clearly in the side, even though it is not entirely ignored. In the publications, research ethics is largely taken into account in the form of source criticism. Big data, digital technology and copyright issues related to research materials and multidisciplinary cooperation are the most common examples of research ethical considerations. Databases, text digitization and web archives are also discussed in the publications. These examples show that research ethics also affect digital humanities, but in practice, research ethics are relatively scarce in publications.

Publications of the CM project were also examined, including some of our own articles. Except for one research ethics oriented article (Östman & Turtiainen 2016) most of the publications have a historical point of view (Suominen 2016; Suominen & Sivula 2016; Saarikoski 2017; Vaahensalo 2017). For this reason, research ethics is reflected mainly in the form of source criticism and transparency. Ethics in these articles is not discussed in more length than in most of the examined digital humanities publications.

Also in this area, a multifaceted, generalized collection of guidelines for multidisciplinary research would probably be of benefit: it would be essentially significant to increase the transparency in research reporting, especially in Digital Humanities, which is complicated and multifaceted of disciplinary nature. Therefore more thorough reporting of ethical matters would increase the transparency of the nature of Digital Humanities in itself.

The Ethics Committee

The Ethics committee of the University of Turku follows the development in the field of research ethics both internationally and nationally. The mission of the committee is to maintain a discussion on research ethics, enhance the realisation of ethical research education and give advice on issues related to research ethics. At the moment its main purpose is to assess and give comments on the research ethics of non-medical research that involves human beings as research subjects and can cause either direct or indirect harm to the participants.

The law about protecting personal info of private citizens appears to be a significant aspect of research ethics. Turtiainen (member of the committee) states that, at the current point, one of the main concerns seems to be poor data protection. The registers constructed of the informant base are often neglected among the humanities, whereas such disciplines as psychology and wellfare research approximately consider them on the regular basis. Then again, the other disciplines do not necessarily consider other aspects of vulnerability so deeply as the (especially culture/tradition-oriented) humanists seem to do.

Our aim is to analyse requests for comments which have not been approved and have therefore been asked to modify before recommendation or re-evaluation. Our interest focuses in arguments that have caused the rejection. Before that phase of our study we need a research permission of our own from the administration of University of Turku – which is in process. It would be an interesting viewpoint to compare the rejected requests for comments from the ethics committee to the results of ethical inquiries within the CM24 project and the outline of research ethics in digital humanities journals and publications.

3 Where do you go now…

According to our current study, it seems that the position of research ethics in Digital Humanities and, more widely, in multidisciplinary research, is somewhat two-fold:

a) for example in the Digital Humanities Program of the Academy of Finland, the significance of ethics is strongly emphasized and the research projects among the program are being encouraged to increase their ethical discussions and the transparency of those. The discourse about and the interest in developing online-oriented research ethics seems to be growing and suggesting that ’something should be done’; the ethical matters should be present in the research projects in a more extensive way.

b) however, it seems that in practice the position of research ethics has not changed much within the last 10 years or so, despite the fact that the digital research environments of the humanities have become more and more multidisciplinary, which leads to multiple understandings about ethics even within individual research projects. Yet, the ethics in research reports is not discussed in more length / depth than earlier. Even in Digital Humanities -oriented journals, ethics is mostly present in a paragraph or two, repeating a few similar concerns in a way which at times seems almost ’automatic’; that is, as if the ethical discussion would have been added ’on the surface’ hastily, because it is required from the outside.

This is an interesting situation. There is a possibility that researchers are not taking seriously the significance of ethical focal points in their research. This is, however, an argument that we would not wish to make. We consider it more likely that in the ever-changing digital research environment, the researches lack multidisciplinary tools for analyzing and discussing ethical matters in the depth that is needed. By examining the current situation extensively, our study is aiming at finding the focal ethical matters in multidisciplinary research environments, and at constructing at least a basic toolbox for Digital Humanities research ethical discussions.

Sources and Literature

Inquiries made by Östman, Turtiainen and Vaahensalo with the researchers the Citizen Mindscapes 24 project. Two rounds in 2016–2017.

Digital Humanities (DigiHum). Academy Programme 2016–2019. Programme memorandum. Helsinki: Academy of Finland.

Digital Humanities journals listed by Digital Humanities at Berkeley. http://digitalhumanities.berkeley.edu/resources/digital-humanities-journals

Markham, Annette & Buchanan, Elizabeth 2012: Ethical Decision-Making and Internet Research: Recommendations from the AoIR Ethics Working Committee (Version 2.0). https://aoir.org/reports/ethics2.pdf.

Saarikoski, Petri: “Ojennat kätesi verkkoon ja joku tarttuu siihen”. Kokemuksia ja muistoja kotimaisen BBS-harrastuksen valtakaudelta. Tekniikan Waiheita 2/2017.

Suominen, Jaakko (2016): ”Helposti ja halvalla? Nettikyselyt kyselyaineiston kokoamisessa.” In: Korkiakangas, Pirjo, Olsson, Pia, Ruotsala, Helena, Åström, Anna-Maria (eds.): Kirjoittamalla kerrotut – kansatieteelliset kyselyt tiedon lähteinä. Ethnos-toimite 19. Ethnos ry., Helsinki, 103–152. [Easy and Cheap? Online surveys in cultural studies.]

Suominen, Jaakko & Sivula, Anna (2016): “Digisyntyisten ilmiöiden historiantutkimus.” In Elo, Kimmo (ed.): Digitaalinen humanismi ja historiatieteet. Historia Mirabilis 12. Turun Historiallinen Yhdistys, Turku, 96–130. [Historical Research of Born Digital Phenomena.]

Turtiainen, Riikka & Östman, Sari 2013: Verkkotutkimuksen eettiset haasteet: Armi ja anoreksia. In: Laaksonen, Salla-Maaria et. al. (eds.): Otteita verkosta. Verkon ja sosiaalisen median tutkimusmenetelmät. Tampere: Vastapaino. pp. 49–67.

– 2009: ”Tavistaidetta ja verkkoviihdettä – omaehtoisten verkkosisältöjen tutkimusetiikkaa.” Teoksessa Grahn, Maarit ja Häyrynen, Maunu (toim.) 2009: Kulttuurituotanto – Kehykset, käytäntö ja prosessit. Tietolipas 230. SKS, Helsinki. 2009. s. 336–358.

Vaahensalo, Elina: Kaikenkattavista portaaleista anarkistiseen sananvapauteen – Suomalaisten verkkokeskustelufoorumien vuosikymmenet. Tekniikan Waiheita 2/2017.

Östman, Sari 2007: ”Nettiksistä blogeihin: Päiväkirjat verkossa.” Tekniikan Waiheita 2/2007. Tekniikan historian seura ry. Helsinki. 37–57.

Östman, Sari 2008: ”Elämäjulkaiseminen – omaelämäkerrallisten traditioiden kuopus.” Elore, vol. 15-2/2008. Suomen Kansantietouden Tutkijain Seura. http://www.elore.fi./arkisto/2_08/ost2_08.pdf.

Östman, Sari & Turtiainen, Riikka 2016: From Research Ethics to Researching Ethics in an Online Specific Context. In Media and Communication, vol. 4. iss. 4. pp. 66¬–74. http://www.cogitatiopress.com/ojs/index.php/mediaandcommunication/article/view/571.

Östman, Sari, Riikka Turtiainen & Elina Vaahensalo 2017: From Online Research Ethics to Researching Online Ethics. Poster. Digital Humanities in the Nordic Countries 2017 Conference.


5:00pm - 5:15pm
Short Paper (10+5min) [abstract]

Copyright exceptions or licensing : how can a library acquire a digital game?

Olivier Charbonneau

Concordia University,

Copyright, caught in a digital maelstrom of perpetual reforms and shifting commercial practices, exacerbates tensions between cultural stakeholders. On the one hand, copyright seems to be drowned in Canada and the USA by the role reserved to copyright exceptions by parliaments and the courts. On the other, institutions, such as libraries, are keen to navigate digital environments by allocating their acquisitions budgets to digital works. How can markets, social systems and institutions emerge or interact if we are not able to resolve this tension?

Beyond the paradigm shifts brought by digital technologies or globalization, one must recognize the conceptual paradox surrounding digital copyrighted works. In economic terms, they behave naturally as public goods, while copyright attempts to restore their rivalrousness and excludability. Within this paradox lies tension, between the aggregate social wealth spread by a work and its commoditized value, between network effects and reserved rights.

In this paper, I will summarize the findings of my doctoral research project and apply them to the case of digital games in libraries.

The goal of my doctoral work was to ascertain the role of libraries in the markets and social systems of digital copyrightable works. Ancillary goals included exploring the “border” between licensing and exceptions in the context of heritage institutions as well as building a new method for capturing the complexity of markets and social systems that stem from digital protected works. To accomplish these goals, I analysed a dataset comprising of the terms and conditions of licenses held by academic libraries in Québec. I show that the terms of these licences overlap with copyright exceptions, highlighting how Libraries express their social mission in two normative contexts: positive law (copyright exceptions) and private ordering (licensing). This overlap is both necessary yet poorly understood - they are not two competing institutional arrangements but the same image reflected in two distinct normative settings. It also provides a road-map for right-holders of how to make digital content available through libraries.

The study also points to the rising importance of automation and computerization in the provisioning of licences in the digital world. Metadata describing the terms of a copyright licence are increasingly represented in computer models and leveraged to mobilize digital corpus for the benefit of a community. Whereas the print world was driven by assumptions and physical limits to using copyrighted works, the digital environment introduces new data points for interactions which were previously hidden from scrutiny. The future lies not in optimizing transaction costs but in crafting elegant institutional arrangements through licensing.

If libraries exist to capture some left-over value in the utility curve of our cultural, informational or knowledge markets, the current role they play in copyright need not change in the digital environment. What does change, however, is hermeneutics: how we attribute value to digital copyrighted works and how we study society’s use of them.

We conclude by transposing the results of this study to the case of digital games. Québec is currently a hotbed for both independent and AAA video game studios. Despite this, a market failure currently exists due to the absence of flexible licensing mechanisms to make indie games available through libraries. This part of the study was funded with the generous support from the Knight Foundation in the USA and conducted at the Technoculture Art & Games (TAG) research cluster of the Milieux Institute for arts, culture and technology at Concordia University in Montréal, Canada.

 

 
Date: Friday, 09/Mar/2018
11:00am - 12:00pmF-PIV-1: Manuscripts, Collections and Geography
Session Chair: Asko Nivala
PIV 
 
11:00am - 11:15am
Distinguished Short Paper (10+5min) [abstract]

Big Data and the Afterlives of Medieval and Renaissance Manuscripts

Toby Burrows1,2, Lynn Ransom3, Hanno Wijsman4, Eero Hyvönen5,6

1University of Oxford; 2University of Western Australia; 3University of Pennsylvania; 4Institut de recherche et d'histoire des textes; 5Aalto University; 6University of Helsinki

Tens of thousands of European medieval and Renaissance manuscripts have survived until the present day. As the result of changes of ownership over the centuries, they are now spread all over the world, in collections across Europe, North America, Asia and Australasia. They often feature among the treasures of libraries, museums, galleries, and archives, and they are frequently the focus of exhibitions and events in these institutions. They provide crucial evidence for research in many disciplines, including textual and literary studies, history, cultural heritage, and the fine arts. They are also objects of research in their own right, with disciplines such as paleography and codicology examining the production, distribution, and history of manuscripts, together with the people and institutions who created, used, owned, and collected them.

Over the last twenty years there has been a proliferation of digital data relating to these manuscripts, not just in the form of catalogues, databases, and vocabularies, but also in digital editions and transcriptions and – especially – in digital images of manuscripts. Overall, however, there is a lack of coherent, interoperable infrastructure for the digital data relating to these manuscripts, and the evidence base remains fragmented and scattered across hundreds, if not thousands, of data sources.

The complexity of navigating multiple printed sources to carry out manuscript research has, if anything, been increased by this proliferation of digital sources of data. Large-scale analysis, for both quantitative and qualitative research questions, still requires very time-consuming exploration of numerous disparate sources and resources, including manuscript catalogues and databases of digitized manuscripts, as well as many forms of secondary literature. As a result, most large-scale research questions about medieval and Renaissance manuscripts remain very difficult, if not impossible, to answer.

The “Mapping Manuscript Migrations” project, funded by the Trans-Atlantic Platform under its Digging into Data Challenge for 2017-2019, aims to address these needs. It is led by the University of Oxford, in partnership with the University of Pennnsylvania, Aalto University in Helsinki, and the Institut de recherche et d’histoire des textes in Paris. The project is building a coherent framework to link manuscript data from various disparate sources, with the aim of enabling searchable and browsable semantic access to aggregated evidence about the history of medieval and Renaissance manuscripts.

This framework is being used as the basis for a large-scale analysis of the history and movement of these manuscripts over the centuries. The broad research questions being addressed include: how many manuscripts have survived; where they are now; and which people and institutions have been involved in their history. More specific research focuses on particular collectors and countries.

The paper will report on the first six months of this project. The topics covered will include the new digital platform being developed, the sources of data which are being combined, the data modeling being carried out to link disparate data sources, the research questions which this assemblage of big data is being used to address, and the ways in which this evidence can be presented and visualized.


11:15am - 11:30am
Short Paper (10+5min) [abstract]

The World According to the Popes: A Geographical Study of the Papal Documents, 2005–2017

Roger Mähler, Fredrik Norén

Umeå University, Sweden,

This paper seeks to explore what an atlas of the popes would be like. Can one study places in texts to map out latent meanings of the Vatican’s political and religious ambitions, and to anticipate evolving trends? Could spatial analysis be a key to better understand a closed institution such as the papacy?

The Holy See is often associated with conservative stability. The papacy has, after all, managed to prevail while states and supranational organizations have come and gone. At the same time, the Vatican has shown remarkable capacity to adapt to scientific findings as well as a changing worldview. This complexity also reflects the geopolitical strategies of the catholic church. For centuries the Vatican has been conscious of geography and politics as key aspects in order to strengthen the Holy See and secure its position on the international scene. During the twentieth century, for example, the church state expanded its global presence. When John Paul II was elected pope in 1978, the Vatican City had full diplomatic ties with 85 states. In 2005, when Benedict XVI was elected, that number had increased to 176. Moreover, the papacy has now formal diplomatic relations with the European Union, and is represented as a permanent observer to various global organizations including United Nations, the African Union, the World Trade Organization, and has even obtained a special membership in the Arabic League (Agnew, 2010; Barbato, 2012). In fact, the emergence of an international public sphere and a global stage have been utilized by the Holy See, and significantly increased its soft power (Barbato, 2012).

As the geopolitical conditions, and ambitions of the Vatican City are changing what happens with its perception of the world, certain regions, and places? Does the relationship between cities, countries, and regions constitute fixed historical patterns, or are these geographical structures evolving, and changing as a new pope is elected? Inspired by Franco Moretti, this study departs from the notion that making connections between places and texts “will allow us to see some significant relationships that have so far escaped us” (Moretti, 1998: 3). The basis of the analysis is all English translated papal documents from Benedictus XVI (2005–2013) and Francis (2013–), retrieved from the Vatican webpage (http://www.vatican.va/holy_father/index.htm).

Methodological Preparations: Scraping Data and Extracting Entities

From a technical point of view, the empirical material used in this study has been prepared in three steps. First, all web page documents in English have been downloaded, and the (proper) text in each document has been extracted and stored. Secondly, the places mentioned in each text document have been identified and extracted using the Stanford Named Entity Recognizer (NER) software. Thirdly, the resulting list of places has been manually reduced by merging name variations of the same place (e.g. “Sweden” and “Kingdom of Sweden”).

The Vatican's communication strategies differ from, let’s say, those of the daily press or the parliamentary parties, in the sense that they have a thousand-year perspective, or work from the point of view of eternity (Hägg, 2007). This is reflected on the Vatican’s webpage, which is immensely informative. Text material from all popes since the late nineteenth century are publicly accessible online, ranging from letters, speeches, bulls to encyclicals, and all with a high optical character recognition (OCR) quality. Since the Holy See always has been a, according Göran Hägg, “mediated one man show”, it makes sense to focus on a corpus of texts written or spoken by the popes in order to study the Vatican’s notion of, basically, everything (Hägg, 2007: 239). The period 2005 to 2016 is pragmatically chosen because of its comprehensive volume of English translated papal documents. Before this period, as Illustration 1 shows, you basically need to master Latin or Italian. While, for example, the English texts from John Paul II (1978–2005) equals to two million words, the corpus of Benedictus XVI (2005–2013) together with current pope Francis sum up to near 59 million words, spread over some 5000 documents.

Illustration 1. The table shows the change in English translated text material available at the Vatican webpage.

The text documents were extracted, or “scraped”, from the Vatican web site using scripts written in the Python programming language. The Scrapy library was used to “crawl” the web site, that is, to follow links of interest, starting from each Pope’s home page, and download each web page that contains a document in English. The site traversal (crawling) was governed by a set of rules specifying what links to follow and what target web pages (documents) to download. The links (to follow) included all links in the left side navigation menu on the Pope’s home page, and the “paging” links in each referenced page. These links were easily identified using commonalities in the link URL’s, and the web pages with the target text documents (in HTML) were likewise identified by links matching the pattern “.../content/name-of-pope/en/.../documents/”. The BeautifulSoap Python library was finally used to extract and cleanse the actual text from the downloaded web pages. (The text was easily identified by a ‘.documento” CSS class.)

In the next step we ran the Stanford Named Entity Recognizer on the collected text material. This software is developed by the Stanford Natural Language Processing Group, and is regarded as one of the most robust implementation of named entity recognition, that is the task of finding, classifying and extracting (or labeling) “entities” within a text. Stanford NER uses a statistical modeling method (Conditional Random Fields, CRFs), has multiple language support, and includes several pre-trained classifier models (new models can also be trained). This study used one of the pre-trained models, the 3 class model (location, person and organization) trained on data from CoNLL 2003 (Reuters Corpus), MUC 6 and MUC 7 (newswire), ACE (newswire, broadcast news), OntoNotes (various sources including newswire and broadcast news) and Wikipedia. (This is the reason why “Hell” was not identified as a place, or why “God” rarely was a person, nor a place. However, since the first two parts of the analysis will focus on what could be labeled as “earthly geography”, this was not considered a problem for the analysis.) Stanford NER tags each identified entity in the input text with the corresponding classifier. These tagged entities were then extracted from the entire text corpus and stored in a single spreadsheet file, aggregated on the number of occurrences per entity and document. (The stored columns were document name, document year, type of document, name of pope, entity, entity classifier, and number of occurrences.)

Even though some of the places identified by Stanford NER were difficult to assess whether they were in fact persons or organizations, they were still kept for the analysis. Furthermore, abstract geographical entities such as ”East”, or very specific ones (but still difficult to geographically identify) like ”Beautiful Gate of the Temple”, or an entity like ”Rome-Byzantium-Moscow”, which could be interpreted as a historic political alliance; all these places were kept for the analysis. After all, in this study the interest lies in the general connections between places, not the rare ones, which easily disappear in the larger patterns.

Papa Analytics

Based on the methodological preparations, the analysis consists of three parts, using different methods, of which the first two parts will utilize the identified place entities. First, the study introduces the spatial world of the recent papacy, using simpler methods to trace, for example, what places occur in the texts, their frequencies, their divisions, whether geopolitical or sacred, which places are the most dominating etc. Furthermore, how the geographical density has changed over time, that is, how many places (total or unique ones) are mentioned per documents or per 1000 words.

Secondly, the analysis studies the clusters of “co-occurring” places, based on places mentioned in the same document. Since most individual papal texts are dedicated to a certain topic, one can assume that places in a document have something in common. The term frequency-inverse document frequency (tf-idf) weighting is used as a measure of how important a place is in a specific document, and this weight is used in the co-occurrence computation. This unfolds the latent geographical network, as it is articulated by the papacy, with centers and peripheries, and both sacred and geopolitical aspects.

Last but not least, this study tries map the space of the divine, as it is expressed through Benedictus XVI and pope Francis, using word2vec, a method developed by a team at Google in 2013, to produce word embeddings (Mikolov et al, 2013). Simply put, the algorithm positions the vocabulary of a corpus in a high-dimensional vector space based on the assumption that “words which are similar in meaning occur in similar contexts” (Rubenstein & Goodenough, 1965: 627). This enables the use of basic numerical methods to compute word (dis-)similarities, to find clusters of similar words, or to create scales on how (subsets of) words are related to certain dichotomies. This study investigates dichotomies such as “Heaven” and “Hell”, “Earth” and “Paradise”, or “God” and “Satan”. Hence, the third part of the study also seeks to relate the earthly geography with the religious space as articulated by the papacy.

References

Agnew, J. (2010). Deus Vult: The Geopolitics of the Catholic Church. Geopolitics, 15(1), 39–61.

Barbato, M. (2012). Papal Diplomacy : The Holy See in World Politics. IPSA XXII World Conference of Political Science, (2003), 1–29.

Finkel, J.R. Grenager, T., and Manning, C. (2005). Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370.

Florian, R., Ittycheriah, A., Jing, H. and Zhang, T. (2003) Named Entity Recognition through Classifier Combination. Proceedings of CoNLL-2003. Edmonton, Canada.

Hägg, G. (2007). Påvarna : två tusen år av makt och helighet. Stockholm: Wahlström & Widstrand.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space, 1–12.

Moretti, F. (1998). Atlas of the european novel: 1800–1900. New York: Verso.

Rodriquez, K. J., Bryant, M., Blanke, T., & Luszczynska, M. (2012). Comparison of Named Entity Recognition tools for raw OCR text. Proceedings of KONVENS 2012 (LThist 2012 Workshop), 2012, 410–414.

Rubenstein, H., & Goodenough, J. B. (1965). Contextual correlates of synonymy. Communications of the ACM, 8(10), 627–633.


11:30am - 11:45am
Short Paper (10+5min) [abstract]

Ownership and geography of books in mid-nineteenth century Iceland

Örn Hrafnkelsson

National and University Library of Iceland,

In October 1865, the national librarian and the only employee of the National Library of Iceland (est. 1818) got the permission from the bishop in Iceland to send out a written request to all provosts around the country to do a detailed survey in there parishes of ownership of old Icelandic books printed before 1816. Title page of each book in every farm should be copied in full detail with line-breaks and ornaments, number of printed pages, place of publication etc.

The aim of this five years project was to compile data for a detailed national bibliography and list of Icelandic authors to build up a good collection of books in the library.

Many of the written reports have survived and are now in the library archive. In my paper, I will talk about these unused sources of ownership of books in every farm in Iceland, how Icelandic book history can now be interpreted in a new and different way and most importantly how we are using these sources with other data to display how ownership of books in the nineteenth century for example varied from different parts of the country. Which books, authors or titles were more popular than other, how many copies have survived, did books related to the Icelandic Enlightenment have any success, did books of some special genres have more chance of survival than others etc.

This is done by using several authority files that have been made in the library for other projects and are in TEI P5 XML. Firstly, a detailed historical bibliography of Icelandic books from 1534 to 1844 and secondly a list of all farms in Iceland with GPS coordinates.

I will also elaborate on this project about ownership of books and geography of books can be developed further and the data can be of use for others. One aspect of my talk is the cooperation between librarians, academics and IT professionals and how unrelated sources can be linked together to bring out new knowledge and interpret history.

Projects website: https://bokaskra.landsbokasafn.is/geography


11:45am - 12:00pm
Distinguished Short Paper (10+5min) [publication ready]

Icelandic Scribes: Results of a 2-Year Project

Sheryl McDonald Werronen

University of Copenhagen,

This paper contributes to the conference theme of History and introduces an online catalogue that recreates an early modern library: the main digital output of the author’s individual research project “Icelandic Scribes” (2016–2018 at the University of Copenhagen). The project has investigated the patronage of manuscripts by Icelander Magnús Jónsson í Vigur (1637–1702), his network of scribes and their working practices, and the significance of the library of hand-written books that he accumulated during his lifetime, in the region of Iceland called the Westfjords. The online catalogue is meant to be a digital resource that reunites this library virtually, gives detailed descriptions of the manuscripts, and highlights the collection’s rich store of texts and the individuals behind their creation. The paper also explores some of the challenges of integrating new data produced by this and other small projects like it with existing online resources in the field of Old Norse-Icelandic studies.

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 654825.

 
4:00pm - 5:30pmF-PIV-2: Digital History
Session Chair: Mikko Tolonen
PIV 
 
4:00pm - 4:30pm
Long Paper (20+10min) [publication ready]

Historical Networks and Identity Formation: Digital Representation of Statistical and Geo- Data to Mobilize Knowledge. Case Study of Norwegian Migration to the USA (1870-1920)

Jana Sverdljuk

National Library of Norway,

The article is a result of the collaborative interdisciplinary workshop, which involved expertise from social sciences, history and digital humanities. It showed how computer mediated ways of researching historical networks and identity formation of Norwegian-Americans substantially complemented historical and social sciences methods. By using open API of the National Archives of Norway we used statistical, geo- and text data to produce an interactive temporal visualization of regional origins in Norway at the USA map. Spatial visualization allowed highlighting space and time and the changing regional belonging as fundamental values for understanding social and cultural dimensions of migrants’ lives. We claim that data visualizations of space and time have performative materiality (Drucker 2013). They open a free room for a researcher to come up with his/her own narrative about the studied phenomenon (Perez and Granger 2015). Visualizations make us reflect on the relationship between the phenomenon and its representation (Klein 2014). This digital method supplements the classical sociological and socio-constructivist methods and has therefore knowledge mobilizing effects. In the article, we show, what potentials this visualization has in relation to the particular field of emigration studies, when entering into a dialogue with the existing historical research in the field.


4:30pm - 4:45pm
Short Paper (10+5min) [abstract]

Spheres of “public” in eighteenth-century Britain

Mark J. Hill1, Antti Kanner1, Jani Marjanen1, Ville Vaara1, Eetu Mäkelä1, Leo Lahti2, Mikko Tolonen1

1University of Helsinki; 2University of Turku

The eighteenth-century saw a transformation in the practices of public discourse. With the emergence of clubs, associations, and, in particular, coffee houses, civic exchange intensified from the late seventeenth century. At the same time print media was transformed: book printing proliferated; new genres emerged (especially novels and small histories); works printed in smaller formats made reading more convenient (including in public); and periodicals - generally printed onto single folio half-sheets - emerged as a separate category of printed work which was written specifically for public consumption, and with the intention of influencing public discourse (such periodicals were intended to be both ephemeral and shared, often read, and then discussed, publically each day). This paper studies how these changes may be recognized in language by quantitatively studying the word “public” and its semantic context in the Eighteenth-Century Collections Online (ECCO).

While there are many descriptions of the transformation of public discourse (both contemporary and historical), there has been limited research into the language revolving (and evolving) around “public” in the eighteenth-century. Jürgen Habermas (2003: 2-3) famously argues that the emergence of words such as “Öffentlichkeit” in German and “publicity” in English are indicative of a change in the public sphere more generally. The conceptual history of “Öffentlichkeit” has been further studied in depth by Lucian Hölscher (1978), but a systematic study of the semantic context of “public” in British eighteenth-century material is missing. Studies that have covered this topic, such as Gunn (1989), base their findings on a very limited set of source material. In contrast, this study, by using a large-scale digitized corpus, aims to supplement earlier studies that focus on individual speech acts or particular collections of sources, and provide a more comprehensive account of how the language of “public” changed in the eighteenth century.

The historical subject matter means that the study is based on the ECCO corpus. While ECCO is in many ways an invaluable resource, a key goal of this study is to be methodologically sound from the perspective of corpus-linguistics and intellectual history, while developing insights which are relevant more generally to sociologists and historians. In this regard, ECCO does come with its own particular problems: both in terms of content and size.

With regard to content: OCR mistakes remain problematic; its heterogeneity in genres can skew investigations; and the unpredictable nature of duplicate texts introduced by numerous reprints of certain volumes must be taken into account. However, many of these problems can be mitigated in different ways. For example, in specific cases we compare findings with the, much smaller, ECCO TCP (an OCR corrected subset of ECCO). We have further used the English Short Title Catalogue (ESTC) to connect textual findings with relevant metadata information contained in the catalogue. By merging ESTC metadata with ECCO, one can more easily use existing historical knowledge (for example, issues around reprints and multiple editions) to engage with the corpus.

With regard to size: the corpus itself is too big to run automatic parsers. We have therefore extracted a separate, and smaller, corpus (with the help of ESTC metadata) to do more complex and demanding analyses. Results of these analyses were then replicated in a much simpler and cruder form on the whole dataset to gauge whether results corroborate the initial observations.

The size constraints provide their own advantages, however. The smaller subsections were chosen to represent pamphlets and other similar short documents by extracting all documents with less than 10406 characters in them. Compared to other specific genres or text types, this proved to be a successful method when attempting to define a meaningful subcorpus, while at the same time limiting effects of reprints, and including a relatively large number of individual writers in the analysis. The subjects covered by pamphlets also tend to be quite historically topical, and as shorter texts, inspecting single occurrences in their original context is much more efficient as things such as main theme, context, and writer’s intentions reveal themselves comparatively quickly compared to larger works. Thus, issues around distant and close reading are more easily overcome. In addition, we are able to compare semantic change between the larger corpus and the more rapidly shifting topical and political debates found in pamphlets, which offers its own historical insights.

In terms of specific linguistic approaches, analysis started with examinations of contextual distributions of “public” by year. Then, by changing the parameters of this analysis (for example, by defining the context as a set of syntactic dependencies engaged by public, or as collocation structures of a wider lexical environment) different aspects of the use of “public” can be brought to the foreground.

As syntactic constraints govern possibilities of combinations of words in shorter ranges of context, the narrower context windows contain a lot of syntactic information in addition to collocational information. Because of this syntactic restrictedness of close range combinations, the semantic relatedness of words with similar short range context distributions is one of degree of mutual interchangeability and, as such, of metaphorical relatedness (Heylen, Peirsman, Geeraerts, Speelman 2008). Wider context windows, such as paragraphs, are free from syntactic constraints, and so semantic relatedness between two words with similar wide range context distributions carries information from frequent contiguity in context and can be described as more metonymical than metaphorical by nature, as is visible from applications based on term-document-matrices, such as topic modelling or Latent Semantic Analysis (cf. Blei, Ng and Jordan (2003) and Dumais (2005))

The syntactic dependencies were counted by analysing the pamphlet subcorpus using Stanford Lexical Parser (Cheng and Manning 2014). Results show changes in the tendency to use “public” as an adjective attribute and in compound positions. Since in English the overwhelmingly most frequent position for both adjective attributes and compounding attributes is preceding head words, this analysis could be adequately replicated using bigrams in the whole dataset. Lexical environments have been analysed by clustering second order collocations (cf. Bertels and Speelman (2014)) and replicated by using a random sampling from the whole dataset to produce the second order vectors.

The study of all bigrams relating to “public” (such as “public opinion”, “public finances”, “public religion”) in ECCO provides for a broader analysis of the use of “public” in eighteenth-century discourse that not only focuses on particular compounds, but provides a better idea of which domains “public” was used in. It points towards a declining trend in relative frequency of religious bigrams during the course of the eighteenth century and rise in the relative frequency of secular bigrams - both political and economic. This allows us to present three arguments: First, it is argued that this is indicative of an overall shift in the language around “public” as the concept’s focus changed and it began to be used in new domains. This expansion of discourses or domains in which “public” was used is confirmed in the analyses of a wider lexical environment. Second, we also notice that some collocates to public, such as “public opinion” and “public good”, gained a stronger rhetorical appeal. They became tropes in their own right and gained a future orientation in political discourse in the latter half of the eighteenth century (Koselleck 1972). Third, by combining the results of the distributional semantics of “public” in ECCO with information extracted from ESTC, one can recognize how different groups used the language relating to “public” in different ways. For example, authors writing on religious topics tended to use “public” differently from authors associated with the enlightenment in Scotland or France.

There are two important upshots to this study: the methodological and the historical. With regard to the former, the paper works as a convincing case study which could be used as an example, or workflow, for studying other words that are pivotal to large structural change. With regard to the latter, the work is of particular historical relevance to recent discussions in eighteenth century intellectual history. In particular, the study contributes to the critical discussion of Habermas that has been taking place in the English-speaking world since the translation of his Structural Transformation of the Public Sphere in 1989, while also informing more traditional historical analyses which have not been able to draw tools from the digital humanities (Hill 2017).

References

Bertels, Ann and Dirk Speelman (2014). “Clustering for semantic purposes. Exploration of semantic similarity in a technical corpus.” Terminology 20:2, pp. 279–303. John Benjamins Publishing Company.

Blei, David, Andrew Y. Ng and Michael I. Jordan (2003). “Latent Dirichlecht Allocation.” Journal of Machine Learning Research 3 (4–5). Pp. 993–1022.

Chen, Danqi and Christopher D Manning (2014). “A Fast and Accurate Dependency Parser using Neural Networks.” Proceedings of EMNLP 2014.

Dumais, Susan T. (2005). Latent Semantic Analysis. Annual Review of Information Science and Technology. 38: 188–230.

Gunn, J.A.W. (1989). “Public opinion.’ Political Innovation and Conceptual Change (Edited by Terence Ball, James Farr & Rusell L. Hanson). Cambridge: Cambridge University Press.

Habermas, Jürgen (2003 [1962]). The Structural Transformation of the Public Sphere: An Inquiry into a Category of Bourgeois Society. Cambridge: Polity.

Heylen, Christopher, Yves Peirsman, Dirk Geeraerts and Dirk Speelman (2008). “Modelling Word Similarity: An Evaluation of Automatic Synonymy Extraction Algorithms.” Proceedings of LREC 2008.

Hill, Mark J. (2017), “Invisible interpretations: reflections on the digital humanities and intellectual history.” Global Intellectual History 1.2, pp. 130-150.

Hölscher, Lucian (1978), “‘Öffentlichkeit.’” Otto Brunner et al. (Hrsg.) Geschichtliche Grundbegriffe. Historisches Lexikon zur politisch-sozialen Sprache in Deutschland. Band 4, Stuttgart, Klett-Cotta, pp. 413–467.

Koselleck, Reinhart (1972), “‘Einleitung.’” Otto Brunner, Werner Conze & Reinhart Koselleck (hrsg.), Geschichtliche Grundbegriffe. Historisches Lexikon zur politisch-sozialen Sprache in Deutschland. Band I, Stuttgart, Klett-Cotta, pp. XIII–XXVII.


4:45pm - 5:00pm
Short Paper (10+5min) [abstract]

Charting the ’Culture’ of Cultural Treaties: Digital Humanities approaches to the history of international ideas

Benjamin G. Martin

Uppsala University

Cultural treaties are the bi-lateral or sometimes multilateral agreements among states that promote and regulate cooperation and exchange in the fields of life we call cultural or intellectual. Pioneered by France just after World War I, this type of treaty represents a distinctive technology of modern international relations, a tool in the toolkit of public diplomacy, a vector of “soft power.” One goal of a comparative examination of these treaties is to locate them in the history of public diplomacy and in the broader history of culture and power in the international arena. But these treaties can also serve as sources for the study of what the historian David Armitage has called “the intellectual history of the international.” In this project, I use digital humanities methods to approach cultural treaties as a historical source with which to explore the emergence of a global concept of culture in the twentieth century. Specifically, the project will investigate the hypothesis that the culture concept, in contrast to earlier ideas of civilization, played a key role in the consolidation of the post-World War II international order.

I approach the topic by charting how concepts of culture were given form in the system of international treaties between 1919 (when the first such treaty was signed) and 1972 (when UNESCO’s Convention on cultural heritage marked the “arrival” of a global embrace of the culture concept), studying them with the large-scale, quantitative methods of the digital humanities, as well as with the tools of textual and conceptual analysis associated with the study of intellectual history. In my paper for DH Nordic 2018, I will outline the topic, goals, and methods of the project, focusing on the ways we (that is, my colleagues at Umeå University’s HUMlab and I) seek to apply DH approaches to this study of global intellectual history.

The project uses computer-assisted quantitative analysis to analyze and visualize how cultural treaties contributed to the spread of cultural concepts and to the development of transnational cultural networks. We explore the source material offered by these treaties by approaching it as two distinct data sets. First, to chart the emergence of an international system of cultural treaties, we use quantitative analysis of the basic information, or “metadata” (countries, date, topic) from the complete set of treaties on cultural matters between 1919 and 1972, approximately 1250 documents. Our source for this information is the World Treaty Index (www.worldtreatyindex.com). This data can also help identify historical patterns in the emergence of a global network of bilateral cultural treaties. Once mapped, these networks will allow me to pose interesting questions by comparing them to any number of other transnational systems. How, for example, does the map of cultural agreements compare to that of trade treaties, military alliances, or to the transnational flows of cultural goods, capital, or migrants?

Second, to identify the development of concepts, we will observe the changing use of key terms through quantitative analysis of the treaty texts. By treating a large group of cultural treaties as several distinct text corpora and, perhaps, as a single text corpus, we will be able explore the treaties using textometry and topic modeling. The treaty texts (digital versions of most which can be found online) will be limited to four subsets: a) Britain, France, and Italy, 1919-1972; b) India, 1947-1972; c) the German Reich (1919-1945) and the two German successor states (1949-1972); and d) UNESCO’s multilateral conventions (1945-1972). This selection is designed to approach a global perspective while taking into account practical factors, such as language and accessibility. Our use of text analysis seeks (a) to offer insight into the changing usage and meanings of concepts like “culture” and “civilization”; (b) to identify which key areas of cultural activity were regulated by the treaties over time and by world region; and (c) to clarify whether “culture” was used in a broad, anthropological sense, or in a narrower sense to refer to the realm of arts, music, and literature. This aspect of the project raises interesting challenges, for example regarding how best to manipulate a multi-lingual text corpus (with texts in English, French, and German, at least).

In these ways, the project seeks to contribute to our understanding of how the concept of culture that guides today’s international society developed. It also explores how digital tools can help us ask (and eventually answer) questions in the field of global intellectual history.


5:00pm - 5:15pm
Short Paper (10+5min) [abstract]

Facilitating Digital History in Finland: What can we learn from the past?

Mats Fridlund, Mila Oiva, Petri Paju

Aalto University

The paper discusses the findings of “From Roadmap to Roadshow: A collective demonstration & information project to strengthen Finnish digital history” project. The project develops the history disciplines in Finland as a collaborative project. The project received funding from the Kone Foundation. The long paper proposed for the DHN2018 will discuss what we have learned about the present day conditions of digital history in Finland, how digital humanities is facilitated today in Finland and abroad, and what suggestions we could give for strengthening the conditions for doing digital history research in Finland.

At the first phase of the project we did a survey among Finnish historians and identified several critical issues that require further development. They were the following: creating better, up-to-date information channels of digital history resources and events, providing relevant education, skills, and teaching by historians, and the need to help historians and information technology specialists to meet and collaborate better and more systematically than before. Many historians also had issues with the concept of digital history and difficulties with such an identity.

In order to situate Finnish digital history in the domestic and international contexts, we have studied the roots of the computational history research in Finland, which date back to the 1960s, and the best practice of how digital history is currently done internationally. We have visited selected digital humanities centers in Europe and the US, which we have identified as having “done something right”. Based on these studies, visits and interviews we will propose steps to be taken for further strengthen the digital history research community in Finland.