Digital Humanities in the Nordic Countries 3rd Conference

11:00am - 11:15am
Distinguished Short Paper (10+5min) [abstract]

Big Data and the Afterlives of Medieval and Renaissance Manuscripts

Toby Burrows^1,2, Lynn Ransom³, Hanno Wijsman⁴, Eero Hyvönen^5,6

¹University of Oxford; ²University of Western Australia; ³University of Pennsylvania; ⁴Institut de recherche et d'histoire des textes; ⁵Aalto University; ⁶University of Helsinki

Tens of thousands of European medieval and Renaissance manuscripts have survived until the present day. As the result of changes of ownership over the centuries, they are now spread all over the world, in collections across Europe, North America, Asia and Australasia. They often feature among the treasures of libraries, museums, galleries, and archives, and they are frequently the focus of exhibitions and events in these institutions. They provide crucial evidence for research in many disciplines, including textual and literary studies, history, cultural heritage, and the fine arts. They are also objects of research in their own right, with disciplines such as paleography and codicology examining the production, distribution, and history of manuscripts, together with the people and institutions who created, used, owned, and collected them.

Over the last twenty years there has been a proliferation of digital data relating to these manuscripts, not just in the form of catalogues, databases, and vocabularies, but also in digital editions and transcriptions and – especially – in digital images of manuscripts. Overall, however, there is a lack of coherent, interoperable infrastructure for the digital data relating to these manuscripts, and the evidence base remains fragmented and scattered across hundreds, if not thousands, of data sources.

The complexity of navigating multiple printed sources to carry out manuscript research has, if anything, been increased by this proliferation of digital sources of data. Large-scale analysis, for both quantitative and qualitative research questions, still requires very time-consuming exploration of numerous disparate sources and resources, including manuscript catalogues and databases of digitized manuscripts, as well as many forms of secondary literature. As a result, most large-scale research questions about medieval and Renaissance manuscripts remain very difficult, if not impossible, to answer.

The “Mapping Manuscript Migrations” project, funded by the Trans-Atlantic Platform under its Digging into Data Challenge for 2017-2019, aims to address these needs. It is led by the University of Oxford, in partnership with the University of Pennnsylvania, Aalto University in Helsinki, and the Institut de recherche et d’histoire des textes in Paris. The project is building a coherent framework to link manuscript data from various disparate sources, with the aim of enabling searchable and browsable semantic access to aggregated evidence about the history of medieval and Renaissance manuscripts.

This framework is being used as the basis for a large-scale analysis of the history and movement of these manuscripts over the centuries. The broad research questions being addressed include: how many manuscripts have survived; where they are now; and which people and institutions have been involved in their history. More specific research focuses on particular collectors and countries.

The paper will report on the first six months of this project. The topics covered will include the new digital platform being developed, the sources of data which are being combined, the data modeling being carried out to link disparate data sources, the research questions which this assemblage of big data is being used to address, and the ways in which this evidence can be presented and visualized.

11:15am - 11:30am
Short Paper (10+5min) [abstract]

The World According to the Popes: A Geographical Study of the Papal Documents, 2005–2017

Roger Mähler, Fredrik Norén

Umeå University, Sweden,

This paper seeks to explore what an atlas of the popes would be like. Can one study places in texts to map out latent meanings of the Vatican’s political and religious ambitions, and to anticipate evolving trends? Could spatial analysis be a key to better understand a closed institution such as the papacy?

The Holy See is often associated with conservative stability. The papacy has, after all, managed to prevail while states and supranational organizations have come and gone. At the same time, the Vatican has shown remarkable capacity to adapt to scientific findings as well as a changing worldview. This complexity also reflects the geopolitical strategies of the catholic church. For centuries the Vatican has been conscious of geography and politics as key aspects in order to strengthen the Holy See and secure its position on the international scene. During the twentieth century, for example, the church state expanded its global presence. When John Paul II was elected pope in 1978, the Vatican City had full diplomatic ties with 85 states. In 2005, when Benedict XVI was elected, that number had increased to 176. Moreover, the papacy has now formal diplomatic relations with the European Union, and is represented as a permanent observer to various global organizations including United Nations, the African Union, the World Trade Organization, and has even obtained a special membership in the Arabic League (Agnew, 2010; Barbato, 2012). In fact, the emergence of an international public sphere and a global stage have been utilized by the Holy See, and significantly increased its soft power (Barbato, 2012).

As the geopolitical conditions, and ambitions of the Vatican City are changing what happens with its perception of the world, certain regions, and places? Does the relationship between cities, countries, and regions constitute fixed historical patterns, or are these geographical structures evolving, and changing as a new pope is elected? Inspired by Franco Moretti, this study departs from the notion that making connections between places and texts “will allow us to see some significant relationships that have so far escaped us” (Moretti, 1998: 3). The basis of the analysis is all English translated papal documents from Benedictus XVI (2005–2013) and Francis (2013–), retrieved from the Vatican webpage (http://www.vatican.va/holy_father/index.htm).

Methodological Preparations: Scraping Data and Extracting Entities

From a technical point of view, the empirical material used in this study has been prepared in three steps. First, all web page documents in English have been downloaded, and the (proper) text in each document has been extracted and stored. Secondly, the places mentioned in each text document have been identified and extracted using the Stanford Named Entity Recognizer (NER) software. Thirdly, the resulting list of places has been manually reduced by merging name variations of the same place (e.g. “Sweden” and “Kingdom of Sweden”).

The Vatican's communication strategies differ from, let’s say, those of the daily press or the parliamentary parties, in the sense that they have a thousand-year perspective, or work from the point of view of eternity (Hägg, 2007). This is reflected on the Vatican’s webpage, which is immensely informative. Text material from all popes since the late nineteenth century are publicly accessible online, ranging from letters, speeches, bulls to encyclicals, and all with a high optical character recognition (OCR) quality. Since the Holy See always has been a, according Göran Hägg, “mediated one man show”, it makes sense to focus on a corpus of texts written or spoken by the popes in order to study the Vatican’s notion of, basically, everything (Hägg, 2007: 239). The period 2005 to 2016 is pragmatically chosen because of its comprehensive volume of English translated papal documents. Before this period, as Illustration 1 shows, you basically need to master Latin or Italian. While, for example, the English texts from John Paul II (1978–2005) equals to two million words, the corpus of Benedictus XVI (2005–2013) together with current pope Francis sum up to near 59 million words, spread over some 5000 documents.

Illustration 1. The table shows the change in English translated text material available at the Vatican webpage.

The text documents were extracted, or “scraped”, from the Vatican web site using scripts written in the Python programming language. The Scrapy library was used to “crawl” the web site, that is, to follow links of interest, starting from each Pope’s home page, and download each web page that contains a document in English. The site traversal (crawling) was governed by a set of rules specifying what links to follow and what target web pages (documents) to download. The links (to follow) included all links in the left side navigation menu on the Pope’s home page, and the “paging” links in each referenced page. These links were easily identified using commonalities in the link URL’s, and the web pages with the target text documents (in HTML) were likewise identified by links matching the pattern “.../content/name-of-pope/en/.../documents/”. The BeautifulSoap Python library was finally used to extract and cleanse the actual text from the downloaded web pages. (The text was easily identified by a ‘.documento” CSS class.)

In the next step we ran the Stanford Named Entity Recognizer on the collected text material. This software is developed by the Stanford Natural Language Processing Group, and is regarded as one of the most robust implementation of named entity recognition, that is the task of finding, classifying and extracting (or labeling) “entities” within a text. Stanford NER uses a statistical modeling method (Conditional Random Fields, CRFs), has multiple language support, and includes several pre-trained classifier models (new models can also be trained). This study used one of the pre-trained models, the 3 class model (location, person and organization) trained on data from CoNLL 2003 (Reuters Corpus), MUC 6 and MUC 7 (newswire), ACE (newswire, broadcast news), OntoNotes (various sources including newswire and broadcast news) and Wikipedia. (This is the reason why “Hell” was not identified as a place, or why “God” rarely was a person, nor a place. However, since the first two parts of the analysis will focus on what could be labeled as “earthly geography”, this was not considered a problem for the analysis.) Stanford NER tags each identified entity in the input text with the corresponding classifier. These tagged entities were then extracted from the entire text corpus and stored in a single spreadsheet file, aggregated on the number of occurrences per entity and document. (The stored columns were document name, document year, type of document, name of pope, entity, entity classifier, and number of occurrences.)

Even though some of the places identified by Stanford NER were difficult to assess whether they were in fact persons or organizations, they were still kept for the analysis. Furthermore, abstract geographical entities such as ”East”, or very specific ones (but still difficult to geographically identify) like ”Beautiful Gate of the Temple”, or an entity like ”Rome-Byzantium-Moscow”, which could be interpreted as a historic political alliance; all these places were kept for the analysis. After all, in this study the interest lies in the general connections between places, not the rare ones, which easily disappear in the larger patterns.

Papa Analytics

Based on the methodological preparations, the analysis consists of three parts, using different methods, of which the first two parts will utilize the identified place entities. First, the study introduces the spatial world of the recent papacy, using simpler methods to trace, for example, what places occur in the texts, their frequencies, their divisions, whether geopolitical or sacred, which places are the most dominating etc. Furthermore, how the geographical density has changed over time, that is, how many places (total or unique ones) are mentioned per documents or per 1000 words.

Secondly, the analysis studies the clusters of “co-occurring” places, based on places mentioned in the same document. Since most individual papal texts are dedicated to a certain topic, one can assume that places in a document have something in common. The term frequency-inverse document frequency (tf-idf) weighting is used as a measure of how important a place is in a specific document, and this weight is used in the co-occurrence computation. This unfolds the latent geographical network, as it is articulated by the papacy, with centers and peripheries, and both sacred and geopolitical aspects.

Last but not least, this study tries map the space of the divine, as it is expressed through Benedictus XVI and pope Francis, using word2vec, a method developed by a team at Google in 2013, to produce word embeddings (Mikolov et al, 2013). Simply put, the algorithm positions the vocabulary of a corpus in a high-dimensional vector space based on the assumption that “words which are similar in meaning occur in similar contexts” (Rubenstein & Goodenough, 1965: 627). This enables the use of basic numerical methods to compute word (dis-)similarities, to find clusters of similar words, or to create scales on how (subsets of) words are related to certain dichotomies. This study investigates dichotomies such as “Heaven” and “Hell”, “Earth” and “Paradise”, or “God” and “Satan”. Hence, the third part of the study also seeks to relate the earthly geography with the religious space as articulated by the papacy.

References

Agnew, J. (2010). Deus Vult: The Geopolitics of the Catholic Church. Geopolitics, 15(1), 39–61.

Barbato, M. (2012). Papal Diplomacy : The Holy See in World Politics. IPSA XXII World Conference of Political Science, (2003), 1–29.

Finkel, J.R. Grenager, T., and Manning, C. (2005). Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370.

Florian, R., Ittycheriah, A., Jing, H. and Zhang, T. (2003) Named Entity Recognition through Classifier Combination. Proceedings of CoNLL-2003. Edmonton, Canada.

Hägg, G. (2007). Påvarna : två tusen år av makt och helighet. Stockholm: Wahlström & Widstrand.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space, 1–12.

Moretti, F. (1998). Atlas of the european novel: 1800–1900. New York: Verso.

Rodriquez, K. J., Bryant, M., Blanke, T., & Luszczynska, M. (2012). Comparison of Named Entity Recognition tools for raw OCR text. Proceedings of KONVENS 2012 (LThist 2012 Workshop), 2012, 410–414.

Rubenstein, H., & Goodenough, J. B. (1965). Contextual correlates of synonymy. Communications of the ACM, 8(10), 627–633.

11:30am - 11:45am
Short Paper (10+5min) [abstract]

Ownership and geography of books in mid-nineteenth century Iceland

Örn Hrafnkelsson

National and University Library of Iceland,

In October 1865, the national librarian and the only employee of the National Library of Iceland (est. 1818) got the permission from the bishop in Iceland to send out a written request to all provosts around the country to do a detailed survey in there parishes of ownership of old Icelandic books printed before 1816. Title page of each book in every farm should be copied in full detail with line-breaks and ornaments, number of printed pages, place of publication etc.

The aim of this five years project was to compile data for a detailed national bibliography and list of Icelandic authors to build up a good collection of books in the library.

Many of the written reports have survived and are now in the library archive. In my paper, I will talk about these unused sources of ownership of books in every farm in Iceland, how Icelandic book history can now be interpreted in a new and different way and most importantly how we are using these sources with other data to display how ownership of books in the nineteenth century for example varied from different parts of the country. Which books, authors or titles were more popular than other, how many copies have survived, did books related to the Icelandic Enlightenment have any success, did books of some special genres have more chance of survival than others etc.

This is done by using several authority files that have been made in the library for other projects and are in TEI P5 XML. Firstly, a detailed historical bibliography of Icelandic books from 1534 to 1844 and secondly a list of all farms in Iceland with GPS coordinates.

I will also elaborate on this project about ownership of books and geography of books can be developed further and the data can be of use for others. One aspect of my talk is the cooperation between librarians, academics and IT professionals and how unrelated sources can be linked together to bring out new knowledge and interpret history.

Projects website: https://bokaskra.landsbokasafn.is/geography

Digital Humanities in the Nordic Countries
3rd Conference

7–9 March 2018, Helsinki

Conference Agenda