The World According to the Popes: A Geographical Study of the Papal Documents, 2005–2017
This paper seeks to explore what an atlas of the popes would be like. Can one study places in texts to map out latent meanings of the Vatican’s political and religious ambitions, and to anticipate evolving trends? Could spatial analysis be a key to better understand a closed institution such as the papacy?
The Holy See is often associated with conservative stability. The papacy has, after all, managed to prevail while states and supranational organizations have come and gone. At the same time, the Vatican has shown remarkable capacity to adapt to scientific findings as well as a changing worldview. This complexity also reflects the geopolitical strategies of the catholic church. For centuries the Vatican has been conscious of geography and politics as key aspects in order to strengthen the Holy See and secure its position on the international scene. During the twentieth century, for example, the church state expanded its global presence. When John Paul II was elected pope in 1978, the Vatican City had full diplomatic ties with 85 states. In 2005, when Benedict XVI was elected, that number had increased to 176. Moreover, the papacy has now formal diplomatic relations with the European Union, and is represented as a permanent observer to various global organizations including United Nations, the African Union, the World Trade Organization, and has even obtained a special membership in the Arabic League (Agnew, 2010; Barbato, 2012). In fact, the emergence of an international public sphere and a global stage have been utilized by the Holy See, and significantly increased its soft power (Barbato, 2012).
As the geopolitical conditions, and ambitions of the Vatican City are changing what happens with its perception of the world, certain regions, and places? Does the relationship between cities, countries, and regions constitute fixed historical patterns, or are these geographical structures evolving, and changing as a new pope is elected? Inspired by Franco Moretti, this study departs from the notion that making connections between places and texts “will allow us to see some significant relationships that have so far escaped us” (Moretti, 1998: 3). The basis of the analysis is all English translated papal documents from Benedictus XVI (2005–2013) and Francis (2013–), retrieved from the Vatican webpage (http://www.vatican.va/holy_father/index.htm).
Methodological Preparations: Scraping Data and Extracting Entities
From a technical point of view, the empirical material used in this study has been prepared in three steps. First, all web page documents in English have been downloaded, and the (proper) text in each document has been extracted and stored. Secondly, the places mentioned in each text document have been identified and extracted using the Stanford Named Entity Recognizer (NER) software. Thirdly, the resulting list of places has been manually reduced by merging name variations of the same place (e.g. “Sweden” and “Kingdom of Sweden”).
The Vatican's communication strategies differ from, let’s say, those of the daily press or the parliamentary parties, in the sense that they have a thousand-year perspective, or work from the point of view of eternity (Hägg, 2007). This is reflected on the Vatican’s webpage, which is immensely informative. Text material from all popes since the late nineteenth century are publicly accessible online, ranging from letters, speeches, bulls to encyclicals, and all with a high optical character recognition (OCR) quality. Since the Holy See always has been a, according Göran Hägg, “mediated one man show”, it makes sense to focus on a corpus of texts written or spoken by the popes in order to study the Vatican’s notion of, basically, everything (Hägg, 2007: 239). The period 2005 to 2016 is pragmatically chosen because of its comprehensive volume of English translated papal documents. Before this period, as Illustration 1 shows, you basically need to master Latin or Italian. While, for example, the English texts from John Paul II (1978–2005) equals to two million words, the corpus of Benedictus XVI (2005–2013) together with current pope Francis sum up to near 59 million words, spread over some 5000 documents.
Illustration 1. The table shows the change in English translated text material available at the Vatican webpage.
The text documents were extracted, or “scraped”, from the Vatican web site using scripts written in the Python programming language. The Scrapy library was used to “crawl” the web site, that is, to follow links of interest, starting from each Pope’s home page, and download each web page that contains a document in English. The site traversal (crawling) was governed by a set of rules specifying what links to follow and what target web pages (documents) to download. The links (to follow) included all links in the left side navigation menu on the Pope’s home page, and the “paging” links in each referenced page. These links were easily identified using commonalities in the link URL’s, and the web pages with the target text documents (in HTML) were likewise identified by links matching the pattern “.../content/name-of-pope/en/.../documents/”. The BeautifulSoap Python library was finally used to extract and cleanse the actual text from the downloaded web pages. (The text was easily identified by a ‘.documento” CSS class.)
In the next step we ran the Stanford Named Entity Recognizer on the collected text material. This software is developed by the Stanford Natural Language Processing Group, and is regarded as one of the most robust implementation of named entity recognition, that is the task of finding, classifying and extracting (or labeling) “entities” within a text. Stanford NER uses a statistical modeling method (Conditional Random Fields, CRFs), has multiple language support, and includes several pre-trained classifier models (new models can also be trained). This study used one of the pre-trained models, the 3 class model (location, person and organization) trained on data from CoNLL 2003 (Reuters Corpus), MUC 6 and MUC 7 (newswire), ACE (newswire, broadcast news), OntoNotes (various sources including newswire and broadcast news) and Wikipedia. (This is the reason why “Hell” was not identified as a place, or why “God” rarely was a person, nor a place. However, since the first two parts of the analysis will focus on what could be labeled as “earthly geography”, this was not considered a problem for the analysis.) Stanford NER tags each identified entity in the input text with the corresponding classifier. These tagged entities were then extracted from the entire text corpus and stored in a single spreadsheet file, aggregated on the number of occurrences per entity and document. (The stored columns were document name, document year, type of document, name of pope, entity, entity classifier, and number of occurrences.)
Even though some of the places identified by Stanford NER were difficult to assess whether they were in fact persons or organizations, they were still kept for the analysis. Furthermore, abstract geographical entities such as ”East”, or very specific ones (but still difficult to geographically identify) like ”Beautiful Gate of the Temple”, or an entity like ”Rome-Byzantium-Moscow”, which could be interpreted as a historic political alliance; all these places were kept for the analysis. After all, in this study the interest lies in the general connections between places, not the rare ones, which easily disappear in the larger patterns.
Papa Analytics
Based on the methodological preparations, the analysis consists of three parts, using different methods, of which the first two parts will utilize the identified place entities. First, the study introduces the spatial world of the recent papacy, using simpler methods to trace, for example, what places occur in the texts, their frequencies, their divisions, whether geopolitical or sacred, which places are the most dominating etc. Furthermore, how the geographical density has changed over time, that is, how many places (total or unique ones) are mentioned per documents or per 1000 words.
Secondly, the analysis studies the clusters of “co-occurring” places, based on places mentioned in the same document. Since most individual papal texts are dedicated to a certain topic, one can assume that places in a document have something in common. The term frequency-inverse document frequency (tf-idf) weighting is used as a measure of how important a place is in a specific document, and this weight is used in the co-occurrence computation. This unfolds the latent geographical network, as it is articulated by the papacy, with centers and peripheries, and both sacred and geopolitical aspects.
Last but not least, this study tries map the space of the divine, as it is expressed through Benedictus XVI and pope Francis, using word2vec, a method developed by a team at Google in 2013, to produce word embeddings (Mikolov et al, 2013). Simply put, the algorithm positions the vocabulary of a corpus in a high-dimensional vector space based on the assumption that “words which are similar in meaning occur in similar contexts” (Rubenstein & Goodenough, 1965: 627). This enables the use of basic numerical methods to compute word (dis-)similarities, to find clusters of similar words, or to create scales on how (subsets of) words are related to certain dichotomies. This study investigates dichotomies such as “Heaven” and “Hell”, “Earth” and “Paradise”, or “God” and “Satan”. Hence, the third part of the study also seeks to relate the earthly geography with the religious space as articulated by the papacy.
References
Agnew, J. (2010). Deus Vult: The Geopolitics of the Catholic Church. Geopolitics, 15(1), 39–61.
Barbato, M. (2012). Papal Diplomacy : The Holy See in World Politics. IPSA XXII World Conference of Political Science, (2003), 1–29.
Finkel, J.R. Grenager, T., and Manning, C. (2005). Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370.
Florian, R., Ittycheriah, A., Jing, H. and Zhang, T. (2003) Named Entity Recognition through Classifier Combination. Proceedings of CoNLL-2003. Edmonton, Canada.
Hägg, G. (2007). Påvarna : två tusen år av makt och helighet. Stockholm: Wahlström & Widstrand.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space, 1–12.
Moretti, F. (1998). Atlas of the european novel: 1800–1900. New York: Verso.
Rodriquez, K. J., Bryant, M., Blanke, T., & Luszczynska, M. (2012). Comparison of Named Entity Recognition tools for raw OCR text. Proceedings of KONVENS 2012 (LThist 2012 Workshop), 2012, 410–414.
Rubenstein, H., & Goodenough, J. B. (1965). Contextual correlates of synonymy. Communications of the ACM, 8(10), 627–633.