Digital Humanities in the Nordic Countries 3rd Conference |
| |
8:00am - 9:00am | Breakfast Lobby, Porthania |
9:00am - 9:15am | Introduction to the Digital & Critical Friday |
Think Corner | |
9:15am - 10:30am | Plenary 3: Caroline Bassett Session Chair: Johanna Sumiala ‘In that we travel there’ – but is that enough?: DH and Technological Utopianism. Watchable also remotely from PII, PIV and P674. |
Think Corner | |
10:30am - 11:00am | Coffee break Lobby, Porthania |
11:00am - 12:00pm | F-PII-1: Creating and Evaluating Data Session Chair: Koenraad De Smedt |
PII | |
|
11:00am - 11:15am
Short Paper (10+5min) [publication ready] Digitisation and Digital Library Presentation System – A Resource-Conscientious Approach National Library of Finland, Finland The National Library of Finland (NLF) has done long-term work to digitise and make available our unique collections. The digitisation policy defines what is to be digitised, and it aims not only to target both rare and unique materials but also to create a large corpus of certain material types. However, as digitisation resources are scarce, the digitisation is planned annually, where prioritisation is done. This involves the library juggling the individual researcher needs with its own legal preservation and availability goals. The digital presentation system at digi.nationallibrary.fi plays a key role, since it enables fast operation by being next to the digitisation process, and it enables a streamlined flow of material via a digital chain from production and to the end users. In this paper, we will describe our digitisation process and its cost-effective improvements, which have been recently applied at the NLF. In addition, we evaluate how we could improve and enrich our digital presentation system and its existing material by utilising results and experience from existing research efforts. We will also briefly examine the positive examples of other national libraries and identify universal features and local differences. 11:15am - 11:30am
Short Paper (10+5min) [publication ready] Digitization of the collections at Ømålsordbogen – the Dictionary of Danish Insular Dialects: challenges and opportunities University of Copenhagen, Ømålsordbogen (the Dictionary of Danish Insular Dialects, henceforth DID) is an historical dictionary giving thorough descriptions of the dialects, i.e. the spoken vernacular of peasants and fishermen, on the Danish isles Seeland, Funen and surrounding islands. It covers the period from 1750 to 1950, the core period being 1850 to 1920. Publishing began in 1992 and the latest volume (11, kurv-lindorm) appeared in 2013 but the project was initiated in 1909 and data collection dates back to the 1920s and 1930s. The project is currently undergoing an extensive process of digitization: old, outdated editing tools have been replaced with modern (database, xml, Unicode), and the old, printed volumes have been extracted to xml as well and are now searchable as a single xml file. Furthermore, the underlying physical data collections are being digitized. In the following we give a brief account of the digitization process, and we discuss a number of questions and dilemmas that this process gives rise to. The collections underlying the DID project comprise a variety of subcollections characterized by a large heterogeneity in terms of form as well as content. The information on the paper slips are usually densified, often idiosyncratic, and normally complicated to decode, even for other specialists. The digitization process naturally points towards web publication of the collections, either alone or in combination with the edited data, but it also gives rise to a number of questions. The current digitization process being very basic, only adding very few metadata (1-2 or 3), we point to the obvious fact that web publication of the collections presupposes an addition of further, carefully selected metadata, taking different user needs and qualifications into account. We also discuss the relationship between edited and non-edited data in a publication perspective. Some of the paper slips are very difficult to decipher due to handwriting or idiosyncratic densification and we point out that web publication in a raw, i.e. non-edited or non-annotated form, might be more misleading than helpful for a number of users. 11:30am - 11:45am
Short Paper (10+5min) [abstract] Cultural heritage collections as research data 1University of Oxford; 2University of Western Australia This presentation will focus on the re-use of data relating to collections in libraries, museums and archives to address research questions in the humanities. Cultural heritage materials held in institutional collections are crucial sources of evidence for many disciplines, ranging from history and literature to anthropology and art. They are also the subjects of research in their own right – encompassing their form, their history, and their content, as well as their places in broader assemblages like collections and ownership networks. They can be studied for their unique and individual qualities, as Neil McGregor demonstrated in his History of the World in 100 Objects, but also as components within a much larger quantitative framework. Large-scale research into the history and characteristics of cultural heritage materials is heavily dependent on the availability of collections data in appropriate formats and sufficient quantities. Unfortunately, this kind of research has been seriously limited, for the most part, by lack of access to suitable curatorial data. In some instances this is simply because collection databases have not been made fully available on the Web – particularly the case with art galleries and some museums. Even where databases are available, however, they often cannot be downloaded in their entirety or through bulk selections of relevant content. Data downloads are frequently limited to small selections of specific records. Collections data are often available only in formats which are difficult to re-use for research purposes. In the case of libraries, the only export formats tend to be proprietary bibliographic schemas such as EndNote or RefCite. Even where APIs are made available, they may be difficult to use or limited in their functionality. CSV or XML downloads are relatively rare. Data licensing regimes may also discourage re-use, either by explicit limitations or by lack of clarity about terms and conditions. Even where researchers are able to download usable data, it is very rare for them to be able to feed back any cleaning or enhancing they may have done. The cultural heritage institutions supplying the data may be unable or unwilling to accept corrections or improvements to their records. They may also be suspicious of researchers developing new digital services which appear to compete with the original database. As a result, there has been a significant disconnect between curatorial databases and researchers, who have struggled to make effective use of what is potentially a very rich source of computationally usable evidence. One important consequence is that re-use of curatorial data by researchers often focuses on the data which are the easiest to obtain. The results are neither particularly representative nor exhaustive, and may weaken the validity of the conclusions drawn from the research. Some recent “collections as data” initiatives (such as collectionsasdata.github.io) have started to explore approaches to best practice for “computationally amenable collections”, with the aim of “encouraging cultural heritage organizations to develop collections and systems that are more amenable to emerging computational methods and tools”. In this presentation, I will suggest some elements of best practice for curatorial institutions in this area. My observations will be based on three projects which are addressing these issues. The first project is “Collecting the West”, in which Western Australian researchers are working with the British Museum to deploy and evaluate the ResearchSpace software, which is designed to integrate heterogeneous collection data into a cultural heritage knowledge graph. The second project is HuNI – the Humanities Networked Infrastructure – which has been building a “virtual laboratory” for the humanities by reshaping collections data into semantic information networks. The third project – “Reconstructing the Phillipps Collection”, funded by the European Union under its Marie Curie Fellowships scheme – involved combining collections data from a range of digital and physical sources to reconstruct the histories of manuscripts in the largest private collection ever assembled. Curatorial institutions should recognize that there is a growing group of researchers who do not simply want to search or browse a collections database. There is an increasing demand for access to collections data for downloading and re-use, in suitable formats and on non-restrictive licensing terms. In return, researchers will be able to offer enhanced and improved ways of analyzing and visualizing data, as well as correcting and amplifying collection database records on the basis of research results. There are significant potential benefits for both sides of this partnership. |
11:00am - 12:00pm | F-PIV-1: Manuscripts, Collections and Geography Session Chair: Asko Nivala |
PIV | |
|
11:00am - 11:15am
Distinguished Short Paper (10+5min) [abstract] Big Data and the Afterlives of Medieval and Renaissance Manuscripts 1University of Oxford; 2University of Western Australia; 3University of Pennsylvania; 4Institut de recherche et d'histoire des textes; 5Aalto University; 6University of Helsinki Tens of thousands of European medieval and Renaissance manuscripts have survived until the present day. As the result of changes of ownership over the centuries, they are now spread all over the world, in collections across Europe, North America, Asia and Australasia. They often feature among the treasures of libraries, museums, galleries, and archives, and they are frequently the focus of exhibitions and events in these institutions. They provide crucial evidence for research in many disciplines, including textual and literary studies, history, cultural heritage, and the fine arts. They are also objects of research in their own right, with disciplines such as paleography and codicology examining the production, distribution, and history of manuscripts, together with the people and institutions who created, used, owned, and collected them. Over the last twenty years there has been a proliferation of digital data relating to these manuscripts, not just in the form of catalogues, databases, and vocabularies, but also in digital editions and transcriptions and – especially – in digital images of manuscripts. Overall, however, there is a lack of coherent, interoperable infrastructure for the digital data relating to these manuscripts, and the evidence base remains fragmented and scattered across hundreds, if not thousands, of data sources. The complexity of navigating multiple printed sources to carry out manuscript research has, if anything, been increased by this proliferation of digital sources of data. Large-scale analysis, for both quantitative and qualitative research questions, still requires very time-consuming exploration of numerous disparate sources and resources, including manuscript catalogues and databases of digitized manuscripts, as well as many forms of secondary literature. As a result, most large-scale research questions about medieval and Renaissance manuscripts remain very difficult, if not impossible, to answer. The “Mapping Manuscript Migrations” project, funded by the Trans-Atlantic Platform under its Digging into Data Challenge for 2017-2019, aims to address these needs. It is led by the University of Oxford, in partnership with the University of Pennnsylvania, Aalto University in Helsinki, and the Institut de recherche et d’histoire des textes in Paris. The project is building a coherent framework to link manuscript data from various disparate sources, with the aim of enabling searchable and browsable semantic access to aggregated evidence about the history of medieval and Renaissance manuscripts. This framework is being used as the basis for a large-scale analysis of the history and movement of these manuscripts over the centuries. The broad research questions being addressed include: how many manuscripts have survived; where they are now; and which people and institutions have been involved in their history. More specific research focuses on particular collectors and countries. The paper will report on the first six months of this project. The topics covered will include the new digital platform being developed, the sources of data which are being combined, the data modeling being carried out to link disparate data sources, the research questions which this assemblage of big data is being used to address, and the ways in which this evidence can be presented and visualized. 11:15am - 11:30am
Short Paper (10+5min) [abstract] The World According to the Popes: A Geographical Study of the Papal Documents, 2005–2017 Umeå University, Sweden, This paper seeks to explore what an atlas of the popes would be like. Can one study places in texts to map out latent meanings of the Vatican’s political and religious ambitions, and to anticipate evolving trends? Could spatial analysis be a key to better understand a closed institution such as the papacy? The Holy See is often associated with conservative stability. The papacy has, after all, managed to prevail while states and supranational organizations have come and gone. At the same time, the Vatican has shown remarkable capacity to adapt to scientific findings as well as a changing worldview. This complexity also reflects the geopolitical strategies of the catholic church. For centuries the Vatican has been conscious of geography and politics as key aspects in order to strengthen the Holy See and secure its position on the international scene. During the twentieth century, for example, the church state expanded its global presence. When John Paul II was elected pope in 1978, the Vatican City had full diplomatic ties with 85 states. In 2005, when Benedict XVI was elected, that number had increased to 176. Moreover, the papacy has now formal diplomatic relations with the European Union, and is represented as a permanent observer to various global organizations including United Nations, the African Union, the World Trade Organization, and has even obtained a special membership in the Arabic League (Agnew, 2010; Barbato, 2012). In fact, the emergence of an international public sphere and a global stage have been utilized by the Holy See, and significantly increased its soft power (Barbato, 2012). As the geopolitical conditions, and ambitions of the Vatican City are changing what happens with its perception of the world, certain regions, and places? Does the relationship between cities, countries, and regions constitute fixed historical patterns, or are these geographical structures evolving, and changing as a new pope is elected? Inspired by Franco Moretti, this study departs from the notion that making connections between places and texts “will allow us to see some significant relationships that have so far escaped us” (Moretti, 1998: 3). The basis of the analysis is all English translated papal documents from Benedictus XVI (2005–2013) and Francis (2013–), retrieved from the Vatican webpage (http://www.vatican.va/holy_father/index.htm). Methodological Preparations: Scraping Data and Extracting Entities From a technical point of view, the empirical material used in this study has been prepared in three steps. First, all web page documents in English have been downloaded, and the (proper) text in each document has been extracted and stored. Secondly, the places mentioned in each text document have been identified and extracted using the Stanford Named Entity Recognizer (NER) software. Thirdly, the resulting list of places has been manually reduced by merging name variations of the same place (e.g. “Sweden” and “Kingdom of Sweden”). The Vatican's communication strategies differ from, let’s say, those of the daily press or the parliamentary parties, in the sense that they have a thousand-year perspective, or work from the point of view of eternity (Hägg, 2007). This is reflected on the Vatican’s webpage, which is immensely informative. Text material from all popes since the late nineteenth century are publicly accessible online, ranging from letters, speeches, bulls to encyclicals, and all with a high optical character recognition (OCR) quality. Since the Holy See always has been a, according Göran Hägg, “mediated one man show”, it makes sense to focus on a corpus of texts written or spoken by the popes in order to study the Vatican’s notion of, basically, everything (Hägg, 2007: 239). The period 2005 to 2016 is pragmatically chosen because of its comprehensive volume of English translated papal documents. Before this period, as Illustration 1 shows, you basically need to master Latin or Italian. While, for example, the English texts from John Paul II (1978–2005) equals to two million words, the corpus of Benedictus XVI (2005–2013) together with current pope Francis sum up to near 59 million words, spread over some 5000 documents. Illustration 1. The table shows the change in English translated text material available at the Vatican webpage. The text documents were extracted, or “scraped”, from the Vatican web site using scripts written in the Python programming language. The Scrapy library was used to “crawl” the web site, that is, to follow links of interest, starting from each Pope’s home page, and download each web page that contains a document in English. The site traversal (crawling) was governed by a set of rules specifying what links to follow and what target web pages (documents) to download. The links (to follow) included all links in the left side navigation menu on the Pope’s home page, and the “paging” links in each referenced page. These links were easily identified using commonalities in the link URL’s, and the web pages with the target text documents (in HTML) were likewise identified by links matching the pattern “.../content/name-of-pope/en/.../documents/”. The BeautifulSoap Python library was finally used to extract and cleanse the actual text from the downloaded web pages. (The text was easily identified by a ‘.documento” CSS class.) In the next step we ran the Stanford Named Entity Recognizer on the collected text material. This software is developed by the Stanford Natural Language Processing Group, and is regarded as one of the most robust implementation of named entity recognition, that is the task of finding, classifying and extracting (or labeling) “entities” within a text. Stanford NER uses a statistical modeling method (Conditional Random Fields, CRFs), has multiple language support, and includes several pre-trained classifier models (new models can also be trained). This study used one of the pre-trained models, the 3 class model (location, person and organization) trained on data from CoNLL 2003 (Reuters Corpus), MUC 6 and MUC 7 (newswire), ACE (newswire, broadcast news), OntoNotes (various sources including newswire and broadcast news) and Wikipedia. (This is the reason why “Hell” was not identified as a place, or why “God” rarely was a person, nor a place. However, since the first two parts of the analysis will focus on what could be labeled as “earthly geography”, this was not considered a problem for the analysis.) Stanford NER tags each identified entity in the input text with the corresponding classifier. These tagged entities were then extracted from the entire text corpus and stored in a single spreadsheet file, aggregated on the number of occurrences per entity and document. (The stored columns were document name, document year, type of document, name of pope, entity, entity classifier, and number of occurrences.) Even though some of the places identified by Stanford NER were difficult to assess whether they were in fact persons or organizations, they were still kept for the analysis. Furthermore, abstract geographical entities such as ”East”, or very specific ones (but still difficult to geographically identify) like ”Beautiful Gate of the Temple”, or an entity like ”Rome-Byzantium-Moscow”, which could be interpreted as a historic political alliance; all these places were kept for the analysis. After all, in this study the interest lies in the general connections between places, not the rare ones, which easily disappear in the larger patterns. Papa Analytics Based on the methodological preparations, the analysis consists of three parts, using different methods, of which the first two parts will utilize the identified place entities. First, the study introduces the spatial world of the recent papacy, using simpler methods to trace, for example, what places occur in the texts, their frequencies, their divisions, whether geopolitical or sacred, which places are the most dominating etc. Furthermore, how the geographical density has changed over time, that is, how many places (total or unique ones) are mentioned per documents or per 1000 words. Secondly, the analysis studies the clusters of “co-occurring” places, based on places mentioned in the same document. Since most individual papal texts are dedicated to a certain topic, one can assume that places in a document have something in common. The term frequency-inverse document frequency (tf-idf) weighting is used as a measure of how important a place is in a specific document, and this weight is used in the co-occurrence computation. This unfolds the latent geographical network, as it is articulated by the papacy, with centers and peripheries, and both sacred and geopolitical aspects. Last but not least, this study tries map the space of the divine, as it is expressed through Benedictus XVI and pope Francis, using word2vec, a method developed by a team at Google in 2013, to produce word embeddings (Mikolov et al, 2013). Simply put, the algorithm positions the vocabulary of a corpus in a high-dimensional vector space based on the assumption that “words which are similar in meaning occur in similar contexts” (Rubenstein & Goodenough, 1965: 627). This enables the use of basic numerical methods to compute word (dis-)similarities, to find clusters of similar words, or to create scales on how (subsets of) words are related to certain dichotomies. This study investigates dichotomies such as “Heaven” and “Hell”, “Earth” and “Paradise”, or “God” and “Satan”. Hence, the third part of the study also seeks to relate the earthly geography with the religious space as articulated by the papacy. References Agnew, J. (2010). Deus Vult: The Geopolitics of the Catholic Church. Geopolitics, 15(1), 39–61. Barbato, M. (2012). Papal Diplomacy : The Holy See in World Politics. IPSA XXII World Conference of Political Science, (2003), 1–29. Finkel, J.R. Grenager, T., and Manning, C. (2005). Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370. Florian, R., Ittycheriah, A., Jing, H. and Zhang, T. (2003) Named Entity Recognition through Classifier Combination. Proceedings of CoNLL-2003. Edmonton, Canada. Hägg, G. (2007). Påvarna : två tusen år av makt och helighet. Stockholm: Wahlström & Widstrand. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space, 1–12. Moretti, F. (1998). Atlas of the european novel: 1800–1900. New York: Verso. Rodriquez, K. J., Bryant, M., Blanke, T., & Luszczynska, M. (2012). Comparison of Named Entity Recognition tools for raw OCR text. Proceedings of KONVENS 2012 (LThist 2012 Workshop), 2012, 410–414. Rubenstein, H., & Goodenough, J. B. (1965). Contextual correlates of synonymy. Communications of the ACM, 8(10), 627–633. 11:30am - 11:45am
Short Paper (10+5min) [abstract] Ownership and geography of books in mid-nineteenth century Iceland National and University Library of Iceland, In October 1865, the national librarian and the only employee of the National Library of Iceland (est. 1818) got the permission from the bishop in Iceland to send out a written request to all provosts around the country to do a detailed survey in there parishes of ownership of old Icelandic books printed before 1816. Title page of each book in every farm should be copied in full detail with line-breaks and ornaments, number of printed pages, place of publication etc. The aim of this five years project was to compile data for a detailed national bibliography and list of Icelandic authors to build up a good collection of books in the library. Many of the written reports have survived and are now in the library archive. In my paper, I will talk about these unused sources of ownership of books in every farm in Iceland, how Icelandic book history can now be interpreted in a new and different way and most importantly how we are using these sources with other data to display how ownership of books in the nineteenth century for example varied from different parts of the country. Which books, authors or titles were more popular than other, how many copies have survived, did books related to the Icelandic Enlightenment have any success, did books of some special genres have more chance of survival than others etc. This is done by using several authority files that have been made in the library for other projects and are in TEI P5 XML. Firstly, a detailed historical bibliography of Icelandic books from 1534 to 1844 and secondly a list of all farms in Iceland with GPS coordinates. I will also elaborate on this project about ownership of books and geography of books can be developed further and the data can be of use for others. One aspect of my talk is the cooperation between librarians, academics and IT professionals and how unrelated sources can be linked together to bring out new knowledge and interpret history. Projects website: https://bokaskra.landsbokasafn.is/geography 11:45am - 12:00pm
Distinguished Short Paper (10+5min) [publication ready] Icelandic Scribes: Results of a 2-Year Project University of Copenhagen, This paper contributes to the conference theme of History and introduces an online catalogue that recreates an early modern library: the main digital output of the author’s individual research project “Icelandic Scribes” (2016–2018 at the University of Copenhagen). The project has investigated the patronage of manuscripts by Icelander Magnús Jónsson í Vigur (1637–1702), his network of scribes and their working practices, and the significance of the library of hand-written books that he accumulated during his lifetime, in the region of Iceland called the Westfjords. The online catalogue is meant to be a digital resource that reunites this library virtually, gives detailed descriptions of the manuscripts, and highlights the collection’s rich store of texts and the individuals behind their creation. The paper also explores some of the challenges of integrating new data produced by this and other small projects like it with existing online resources in the field of Old Norse-Icelandic studies. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 654825. |
11:00am - 12:00pm | F-P674-1: Teaching and Learning the Digital Session Chair: Maija Paavolainen |
P674 | |
|
11:00am - 11:15am
Short Paper (10+5min) [publication ready] Creative Coding at the arts and crafts school Robotti (Käsityökoulu Robotti) Aalto-University, the school of Arts, Design and Architecture, The increasing use of digital technologies presents a new set of challenges that, in addition to key economic and societal viewpoints, also reflects similar use in both education and culture. On the other hand, instead of a challenge, digitalization of our environment can also be seen as new material and a new medium for art and art education. This article suggests that both a better understanding of digital structures, and the ability for greater self-expression through digital technology is possible using creative coding as a teaching method. This article focuses on Käsityökoulu Robotti (www.kasityokoulurobotti.fi), a type of hacker space for children that offers children teaching about art and technology. Käsityökoulu Robotti is situated within the contexts of art education, the maker movement, critical technology education, and media art. Art education is essential to Käsityökoulu Robotti in a bilateral sense, i.e., to discover in what ways art can be used to create clearer understanding of technology and at the same time teach children how to use new technological tools as a way to greater self-expression. These questions are indeed intertwined, as digital technology, like code, can be a substantial way to express oneself in ways that otherwise could not be expressed. Further, using artistic approaches, such as creative coding, can generate more tangible knowledge of digital technology. A deeper understanding of digital technology is also critical when dealing with the ever-increasing digitalization of our society, as it helps society to understand the digital structures that underlie our continually expanding digital world. This article examines how creative coding works as a teaching method in Käsityökoulu Robotti to promote both artistic expression and a critical understanding of technology. Further still, creative coding is a tool for bridging the gap between maker movement, critical thinking and art practices and bring each into sharper focus. This discussion is the outcome of an ethnographic research project at Käsityökoulu Robotti. 11:15am - 11:30am
Distinguished Short Paper (10+5min) [abstract] A long way? Introducing digitized historic newspapers in school, a case study from Finland University of Helsinki During 2016/17 two Finnish newspapers, from their first issue to their last, were made available to schools in eastern Finland through the digital collections of the National Library of Finland (http://digi.kansalliskirjasto.fi). This paper presents the case study of one upper-secondary class making use of these materials. Before having access to these newspapers, the teachers in the school in question had little awareness of what this digital library contained. The initial research questions of this paper are whether digitised historic newspapers can be used by school communities, and what practices they enable. Subsequently, the paper explores how these practices relate to teachers’ habits and to the wider concept of literacy, that is, the knowledge and skills students can acquire using these materials. To examine the significance of historic newspapers in the context of their use today, I rely on the concept of ‘practice’ defined by cultural theorist Andreas Reckwitz as the “use of things that ‘mould’ activities, understandings and knowledge”. To correctly assess practice, I approached this research through ethnographic methods, constructing the inquiry with participants in the research: teachers, students and the people involved in facilitating the materials. During 2016, I conducted eight in-depth interviews with teachers about their habits, organized a focus group with further 15 teachers to brainstorm activities using historic newspapers, and collaborated closely with one language and literature teacher, who implemented the materials in her class right away. Observing her students work and hearing their presentations, motivations, and opinions about the materials showed how students explored the historical background of their existing personal, school-related and even professional interests. In addition to the students’ projects, I also collected their newspaper clippings and logs of their searches in the digital library. These digital research assets revealed how the digital library that contains the historic newspapers influenced the students’ freedom to choose a topic to investigate and their capacity to ‘go deep’ in their research. The findings of this case study build upon, and extend, previous research about how digitized historical sources contribute in upper-secondary education. The way students used historical newspapers revealed similarities with activities involving contemporary newspapers, as described by the teachers who participated in this study. Additionally, both the historicity and the form of presentation of newspapers in a digital library confer unique attributes upon these materials: they allow students to explore the historical background of their research interests, discover change across time, verbalize their research ideas in a concrete manner, and train their skills in distant and close reading to manage large amounts of digital content. In addition to these positive attributes that connect with learning goals set by teachers, students also tested the limits of these materials. The lack of metadata in articles or images, the absence of colour in materials that originally have it, or the need for students to be mindful of how language has changed since the publication of the newspapers are constrains that distinguish digital libraries from resources, such as web browsers and news sites, that are more familiar to students. Being aware of these positive and negative affordances, common to digital libraries containing historic newspapers and other historical sources, can support teachers in providing their students effective guidelines when using this kind of materials. This use case demonstrates that digitized historical sources in education can do more than simply enabling students to “follow the steps of contemporary historians”, as research has previously established. These materials could also occupy a place between history and media education. The objective of media education in school –regardless of the technological underpinnings of a single medium, which change rapidly in this digital age– aims at enabling students to reflect on the processes of media consumption and production. The contribution of digitized historical newspapers to this subject is acquainting students with processes of media preservation and heritage. However, it could still be a long way until teachers adopt these aspects in their plans. It is necessary to acknowledge the trajectory and agents involved, since the 1960s, in the work of introducing newspapers in education. This task not only consisted of facilitating access to newspapers, but also of developing teaching plans and advocating for a common understanding and presence of media education in schools. In addition to uncovering an aspect of digital cultural heritage that is relevant for the school community today, another aim of this paper is to raise awareness among the cultural heritage community, especially national libraries, about the diversity in the uses and users of their collections, especially in a time when the large-scale digitization of special collections is generalizing access to materials traditionally considered for academic research. Selected bibliography: Buckingham, D. (2003). Media education: literacy, learning, and contemporary culture. Polity Press. Gooding, P. (2016). Historic Newspapers in the Digital Age: ‘Search All About It!’ Routledge. Lévesque, S. (2006). Discovering the Past: Engaging Canadian Students in Digital History. Canadian Social Studies, 40(1). Martens, H. (2010). Evaluating Media Literacy Education: Concepts, Theories and Future Directions. Journal of Media Literacy Education, 2(1). Nygren, T. (2015). Students Writing History Using Traditional and Digital Archives. Human IT, 12(3), 78–116. Reckwitz, A. (2002). Toward a Theory of Social Practices: A Development in Culturalist Theorizing. European Journal of Social Theory, 5(2), 243–263. 11:30am - 11:45am
Short Paper (10+5min) [abstract] “See me! Not my gender, race, or social class”: Combating Stereotyping and prejudice mixing digitally manipulated experience with classroom debriefing. 1Department of Language Studies, Umeå University, Sweden; 2School of Humanities, Education and Social Sciences, Örebro University, Sweden; 3Humlab, Umeå University, Sweden INTRODUCTION Not only does stereotyping, based on various social categories such as age, social class, ethnicity, sexuality, regional affiliation, and gender serve to simplify how we perceive and process information about individuals (Talbot et al. 2003: 468), it also builds up expectations on how we act. If we recognise social identity as an ongoing construct, and something that is renegotiated during every meeting between humans (Crawford 1995), it is reasonable to speculate that stereotypic expectations will affect the choices we make when interacting with another individual. Thus, stereotyping may form the basis for the negotiation of social identity on the micro level. For example, research has shown that white American respondents react with hostile face expressions or tone of voice when confronted with African American faces, which is likely to elicit the same behaviour in response, but, as Bargh et al. point out (1996: 242), “because one is not aware of one's own role in provoking it, one may attribute it to the stereotyped group member (and, hence, the group)”. Language is a key element in this process. An awareness of such phenomena, and how we unknowingly may be affected by the same, is, we would argue, essential for all professions where human interaction is in focus (psychologists, teachers, social workers, health workers etc.). RAVE (Raising Awareness through Virtual Experiencing) funded by the Swedish Research Council, aims to explore and develop innovative pedagogical methods for raising subjects’ awareness of their own linguistic stereotyping, biases and prejudices, and to systematically explore ways of testing the efficiency of these methods. The main approach is the use of digital matched-guise testing techniques with the ultimate goal to create an online, packaged and battle-tested, method available for public use. We are confident that there is a place for this, in our view, timely product. There can be little doubt that the zeitgeist of the 21st centuries first two decades has swung the pendulum in a direction where it has become apparent that the role of Humanities should be central. In times when unscrupulous politicians take every chance to draw on any prejudice and stereotypical assumptions about Others, be they related to gender, ethnicity or sexuality, it is the role of the Humanities to hold up a mirror and let us see ourselves for what we are. This is precisely the aim of the RAVE project. In line with this thinking, open access to our materials and methods is of primary importance. Here our ambition is not only to provide tested sample cases for open access use, but also to provide clear directives on how these have been produced so that new cases, based on our methods, can be created. This includes clear guidelines as to what important criteria need to be taken into account when so doing, so that our methodology is disseminated openly and in such a fashion that it becomes adaptable to new contexts. METHOD The RAVE method at its core relies on a treatment session where two groups of test subjects (i.e. students) each are exposed to one out of two different versions of the same scripted dialogue. The two versions differ only with respect to the perception of the gender of the characters, whereas scripted properties remain constant. In one version, for example, one participant, “Terry”, may sound like a man, while in the other recording this character has been manipulated for pitch and timbre to sound like a woman. After the exposure, the subjects are presented with a survey where they are asked to respond to questions related to linguistic behaviour and character traits one of the interlocutors. The responses of the two sub-groups are then compared and followed up in a debriefing session, where issues such as stereotypical effects are discussed. The two property-bent versions are based on a single recording, and the switch of the property (for instance, gender) is done using digital methods described below. The reason for this procedure is to minimize the number of uncontrolled variables that could affect the outcome of the experiment. It is a very difficult - if not an impossible - task to transform the identity-related aspects of a voice recording, such as gender or accent, while maintaining a “perfect” and natural voice - a voice that is opposite in the specific aspect, but equivalent in all other aspects, and doing so without changing other properties in the process or introducing artificial artifacts. Accordingly, the RAVE method doesn’t strive for perfection, but focuses on achieving a perceived credibility of the scripted dialogue. However, the base recording is produced with a high quality to provide the best possible conditions for the digital manipulation. For instance, the dialogue between the two speakers are recorded on separate tracks so as to keep the voices isolated. The digital manipulation is done with the Praat software (Boersma & Weenink, 2013). Formants, range and and pitch median are manipulated for gender switching using standard offsets and are then adapted to the individual characteristics of the voices. Several versions of the manipulated dialogues are produced, and evaluated by a test group via an online survey. Based on the survey result, the one with the highest quality is selected. This manipulated dialogue needs further framing to reach a sufficient level of credibility. The way the dialogue is framed for the specific target context, how it is packaged and introduced is of critical importance. Various kinds of techniques, for instance use of audiovisual cues, are used to distract the test subject from the “artificial feeling”, as well as to enforce the desired target property. We add various kinds of distractions, both audial and visual, which lessen the listeners’ focus on the current speaker, such as background voices simulating the dialogue taking place in a cafe, traffic noise, or scrambling techniques simulating, for instance, a low-quality phone or a Skype call. On this account, the RAVE method includes a procedure to evaluate the overall (perceived) quality and credibility of a specific case setup.This evaluation is implemented by exposing a number of pre-test subjects to the packaged dialogue (in a set-up comparable to the target context). After the exposure, the pre-test subjects respond to a survey designed to measure the combined impression of aspects such as the scripted dialogue, the selected narrators, the voices, the overall set-up, the contextual framing etc. The produced dialogues, and accompanying response surveys are turned into a single online package using the program Storyline. The single entry point to the package makes the process of collecting anonymous participant responses more fail-safe and easier to carry out. The whole package is produced for a “bring your own device” set-up, where the participants use their own smart phones, tablets or laptops to take part in the experiment. These choices of using an online single point of entry package adapted to various kinds of devices have been made to facilitate experiment participation and recording of results. The results from the experiment is then collected by the teacher and discussed with the students at an ensuing debriefing seminar. FINDINGS At this stage, we have conducted experiments using the RAVE method with different groups of respondents, ranging from teacher trainees, psychology students, students of sociology, active teachers, the public at large etc, in Sweden and elsewhere. Since the experiments have been carried out in other cultural contexts (in the Seychelles, in particular), we have received results that enable cross-cultural comparisons. All trials conducted addressing gender stereotyping have supported our hypothesis that linguistic stereotyping acts as a filter. In trials conducted with teacher trainees in Sweden (n = 61), we could show that respondents who listened to the male guise overestimated stereotypical masculine conversational features such as how often the speaker interrupted, how much floor space ‘he’ occupied, and how often ‘he’ contradicted his counterpart. On the other hand, features such as signalling interest and being sympathetic were overestimated by the respondents when listening to the female guise. Results from the Seychelles have strengthened our hypothesis. Surveys investigating linguistic features associated with gender showed that respondents’ (n=46) linguistic gender stereotyping was quite different from that of Swedish respondents. For example, the results from the Seychelles trials showed that floor space and the number of interruptions made were overestimated by the respondents listening to the female guise, quite unlike the Swedish respondents, but still in line with our hypothesis. Trials using psychology students (n=101) have similar results. In experiments where students were asked to rate a case character’s (‘Kim’) personality traits and social behaviour, our findings show that the male version of Kim was deemed more unfriendly and a bit careless compared to the female version of Kim, who was regarded to be more friendly and careful. Again, this shows that respondents overestimate aspects that confirm their stereotypic preconceptions. PEDAGOGY The underlying pedagogical idea for the set-up is to confront students and other participants with their own stereotypical assumptions. In our experience, discussing stereotypes with psychology and teacher training students does not give rise to the degree of self-reflection we would like. This is what we wanted to remedy. With the method described here, where the dialogues are identical except for the manipulation in terms of pitch and timbre, perceived differences in personality and social behaviour can only be explained as residing in the beholder. A debriefing seminar after the exposure gave the students an opportunity to reflect on the results from the experiment. They were divided into mixed groups where half the students had listened to and responded to the male guise, and the other half to the female guise. Since any difference between the groups was the result of the participants’ rating, their own reactions to the conversations, there was something very concrete and urgent to discuss. Thus, the experiment affected the engagement positively. Clearly, the concrete and experiential nature of this method made the students analyze the topic, their own answers, the reasons for these and, ultimately, themselves in greater detail and depth in order to understand the results from the experiment, and try to relate the results to earlier research findings. Judging from these impressions, the method is clearly very effective. Answers from a survey with psychology students (n=101) after the debriefing corroborate this impression. In response to the question “What was your general experience of the experiment that you have just partaken in? Did you learn anything new?”, a clear majority of the students responded positively: 76 %. Moreover, close to half of these answers explicitly expressed self-reflective learning. Of the remaining comments, 15 % were neutral, and 9 % expressed critical feedback. Examples of responses expressing self-reflection include: “… It gave me food for thought. Even though I believed myself to be relatively free of prejudice I can't help but wonder if I make assumptions about personalities merely from the time of someone's voice.” And: “I learned some of my own preconceptions and prejudices that I didn't know I had.” An example of a positive comment with no self-reflective element is: “Female and male stereotypes were stronger than I expected, even if only influenced by the voice”, The number of negative comments was small. The negative comments generally took the position that the results were expected so there was nothing to discuss, or that the student had figured out the set-up from the beginning. A few negative comments revealed that the political dimension of the subject of gender could influence responses. These students would probably react in the same way to a traditional seminar. We haven’t been able to reach everyone … yet ... 11:45am - 12:00pm
Short Paper (10+5min) [abstract] Digital archives and the learning processes of performance art University of Helsinki In this presentation, the process of learning performance art is articulated in the contextual change that digital archives have caused starting from the early 1990s. It is part of my postdoctoral research, artistic research on the conjunctions between divergent gestures of thought and performance, done in a research project How to Do Things with Performance? funded by the Academy of Finland. Since performance art is a form of ‘live art’, it would be easy to regard that the learning processes are also mostly based on the physical practice and repetition. However, in my regard, performance art is a significant line of flight from the 1960’s and 70’s conceptual art, alongside the video-art. Therefore, the pedagogy of performance art has been tightly connected with the development of media from the collective use of the Portapak video cameras and the recent development of VR-attributed performances, or choreographic archive methods by such figures like William Forsythe, or the digital journals of artistic research like Ruukku-journal or Journal for Artistic Research, JAR. This presentation will speculate on the transformation of performance art practices, since when the vast amount of historical archive materials has become accessible to artists, notwithstanding the physical location of a student or an artist. At the same time the social media affects the peer groups of artists. My point of view is not based on statistics, but on the notions that I have gathered from the teaching of performance art, as well as instructing MA and PhD level research projects. The argument is that the emphasis on learning in performative practices is not based on talent, but it rather is general and generic, where the access to networks and digital archives serve as a tool for social form of organization. Or speculation on what performance art is? In this sense, and finally my argument is that the digital virtuality does not conflate with the concept of the virtual. On this, my argument leans on the philosophical thought on actualization and the virtual by Henri Bergson, Gilles Deleuze and Alexander R. Galloway. The access to the digital archives in the learning processes is rather based on the premise that artistic practices are explicitly actualizations of the virtual, already. The digitalization is a modality of this process. The learning process of performance art is not done through resemblance, but doing with someone or something else and developed in heterogeneity with digital virtualities. |
11:00am - 12:00pm | F-TC-1: Data, Activism and Transgression Session Chair: Marianne Ping Huang |
Think Corner | |
|
11:00am - 11:30am
Long Paper (20+10min) [abstract] Shaping data futures: Towards non-data-centric data activism 1Consumer Society Research Centre, University of Helsinki, Finland,; 2HIIT, Aalto University The social science debate that attends to the exploitative forces of the quantification of aspects of life previously experienced in qualitative form, recognising the ubiquitous forms of datafied power and domination, is by now an established perspective to question datafication and algorithmic control (Ruckenstein and Schüll, 2017). Drawing from the critical political economy and neo-Foucauldian analyses researchers have explored the effects of the datafication (Mayer-Schönberger and Cukier. 2013; Van Dijck, 2014) on the economy, public life, and self-understanding. Studies alert us to threats to privacy posed by “dataveillance” (Raley, 2012; Van Dijck, 2014), forms of surveillance distributed across multiple interested parties, including government agencies, insurance payers, operators, data aggregators, analytics companies, and individuals who provide the information either knowingly or unintentionally when going online, using self-tracking devices, loyalty programs, and credit cards. The “data traces” add to the data accumulated in databases and personal data – any data related to a person or resulting from actions by a person – becomes utilized for business and societal purposes in an increasingly systematic matter (Van Dijck and Poell, 2016; Zuboff, 2015). In this paper, we take an “activist stance”, aiming to contribute to the current criticism of datafication with a more participatory and collaborative approach offered by “data activism” (Baack 2015; Milan and van der Velden, 2016), and civic and political engagement spurred by datafication. The various data-driven initiatives currently under development suggest that the problematic aspects of datafication, including the tension between data openness and data ownership (Neff, 2013), the asymmetries in terms of data usage and distribution (Wilbanks and Topol, 2016; Kish and Topol, 2015) and the inadequacy of existing informed consent and privacy protections (Sharon, 2016) are by now not only well recognized, but they are generating new forms of civic and political engagement and activism. This calls for more debate on what these new forms of data activism are and how scholars in the humanities and social science communities can assess them. By relying on the approaches developed within the field of Techno-Anthropology (Børsen and Botin, 2013; Ruckenstein and Pantzar, 2015), seeking to translate and mediate knowledge concerning complex technoscientific projects and aims, we positioned ourselves as “outside insiders” with regard to a data-centric initiative called MyData. In 2014, we became observers and participants of the MyData, promoting the understanding that people benefit when they can control data gathering and analysis by public organizations and businesses and become more active data citizens and consumers. The high-level MyData vision, described in ‘the MyData white paper’ written primarily by researchers at the Helsinki Institute for Information Technology and the Tampere University of Technology (Poikola et al., 2015), outlines an alternative future that transforms the ’organisation-centric system‘ into ’a human-centric system‘ that treats personal data as a resource that the individual can access, control, benefit and learn from. The paper discusses “our” data activism and the activism of technology developers, promoting and relying on two different kinds of “social imaginaries” (Taylor, 2004). By doing so, we open a perspective to data activism that highlights ideological and political underpinnings of contested social imaginaries and aims. Current data-driven initiatives tend to proceed with a social imaginary that treats data arrangements as solutions, or corrective measures addressing unsatisfactory developments. They advance a logic of an innovation culture, relying on the development of new technology structures and computationally intensive tools. This means that the data-driven initiatives rely on an engineering attitude that does not question the power of technological innovation for creating better societal solutions or, more broadly, the role of datafication in societal development. The main focus is on the correct positioning of technology: undesirable, or harmful developments need to be reversed, or redirected towards ethically more fair and responsible practices. Since we do not possess impressive technology skills, or proficiency in legal and regulatory matters, which would have aligned us with the innovation-driven data activism, our position in the technology-driven data activism scene is structurally fairly weak. Our data activism is informed by a sensitivity to questions of cultural change and the critical stance representative to social scientific inquiry, questioning the optimistic and future-oriented social imaginary of technology developers. As will be discussed in our presentation, this means that our data activism is incompatible with those of technology developers in a profound sense, explaining why our activist role was repeatedly reduced to viewing a stream of diagrams on PowerPoint slides depicting databases and data flows. In terms of designing future data transfers and data flows, our social imaginary remained oddly irrelevant, intensifying the feeling that we were observing a moving target and our task was to simply keep up, while the engineers were busy doing to the real work of activists, developing approaches that give users more control over their personal data, such as the Kantara Initiative’s User-Managed Access (UMA) protocol, experimenting with Blockchain technologies for digital identities such as Sovrin, and learning about “Vendor Relationship Management” systems (see, Belli et al., 2017). From the outsider position, we started to craft a narrative about the MyData initiative that aligns with our social imaginary. We wanted to push the conversation further, beyond the usual technological, legal and policy frameworks, and suggest that with its techno-optimism the current MyData work might actually weaken data activism and public support for it. We turned to literary and scholarly sources with the aim of opening a critical, but hopefully also a productive conversation about MyData in order to offer ideas of how to promote socially more robust data activism. A seminal text that shares aims of the MyData initiative is the Autonomous Technology – Technics-out-of-Control as a Theme in Political Thought (1978) by Langdon Winner. Winner perceives the relationship between human and technology in terms of Kantian autonomy: via analysis of interrelations of independence and dependence. The core ideas of the MyData vision have particular resonance with the way Winner (1978) considers “reverse adaptation”, wherein the human adapts to the power of the system and not the other way around. In this paper, we first describe the MyData vision, as it has been presented by the activists, and situate it in the framework of technology critique and current critique of the digital culture and economy. Here, we demonstrate that the outside position can, in fact, resource a re-articulation of data activism. After this, we detail some further developments in the MyData scene and possibilities that have opened for dialogue and collaboration during our data activism journey. We end the discussion by noting that for truly promoting societally beneficial data arrangements, work is needed to circumvent the individualistic and data-centric biases of initiatives such as the MyData. We promote non-data-centric data activism that meshes critical thinking into the mundane realities of everyday practices and calls for historically informed and collectively oriented alternatives and action. Overall, our goal is to demonstrate that with a focus on ordinary people, professionals and communities of practice, ethnographic methods and practice-based analysis can deepen understandings of datafication by revealing how data and its technologies are taken up, valued, enacted, and sometimes repurposed in ways that either do not comply with imposed data regimes, or mobilize data in inventive ways (Nafus & Sherman, 2014). By learning about everyday data worlds and actual material data practices, we can strengthen the understanding of how data technologies could become a part of promoting and enacting more responsible data futures. Paradoxically, in order to arrive to an understanding of how data initiatives support societally beneficial developments, non-data-centric data activism is called for. By aiming at non-data-centric data activism, we can continue to argue against triumphant data stories and technological solutionism in ways that are critical, but do not deny the possible value of digital data in future making. We will not try to protect ourselves against data forces but act imaginatively with and within them to develop new concepts, frameworks and collaborations in order to better steer them. References Baack, S. 2015. Datafication and empowerment: How the open data movement re-articulates notions of democracy, participation, and journalism. Big Data & Society, Oct. Belli, L., Schwartz, M., & Louzada, L. (2017). Selling your soul while negotiating the conditions: from notice and consent to data control by design. Health and Technology, 1-15. Børsen, T. & Botin, L. (eds) (2013). What Is Techno-Anthropology? Aalborg, Denmark: Aalborg University Press. Kish, L. J., & Topol, E. J. (2015). Unpatients: why patients should own their medical data. Nature biotechnology, 33(9), 921-924. Mayer-Schönberger, V., and K. Cukier. (2013). Big data: a revolution that will transform how we live, work, and think. Boston: Houghton Mifflin Harcourt. McQuillan, D. (2016). Algorithmic Paranoia and the Convivial Alternative. Big Data and Society 3(2). McStay, Andrew (2013). Privacy and Philosophy: New Media and Affective Protocol. New York: Peter Lang. Milan, S., & Velden, L. V. D. (2016). The alternative epistemologies of data activism. Digital Culture & Society, 2(2), 57-74. Nafus, D. and Sherman, J. (2014). This One Does Not Go Up to 11: The Quantified Self Movement as an Alternative Big Data Practice. International Journal of Communication 8: 1784-1794. Poikola, A.; Kuikkaniemi, K.; & Kuittinen, O. (2014). My Data – Johdatus ihmiskeskeiseen henkilötiedon hyödyntämiseen [‘My Data – Introduction to Human-centred Utilisation of Personal Data’]. Helsinki: Finnish Ministry of Transport and Communications. Poikola, A.; Kuikkaniemi, K.; & Honko, H. (2015). MyData – a Nordic Model for Human-centered Personal Data Management and Processing. Helsinki: Finnish Ministry of Transport and Communications. Raley, R. (2013). Dataveillance and Counterveillance, in ed. Gitelman, Raw Data is an Oxymoron. Cambridge: MIT Press. Ruckenstein, M. & Pantzar, M. (2015). Datafied life: Techno-anthropology as a site for exploration and experimentation. Techné: Research in Philosophy & Technology. 19(2), 191–210. Ruckenstein, M., & Schüll, N. D. (2017). The Datafication of Health. Annual Review of Anthropology, (0). Sharon, T. (2016) Self-Tracking for Health and the Quantified Self: Re-Articulating Autonomy, Solidarity, and Authenticity in an Age of Personalized Healthcare. Philosophy & Technology, 1-29. Taylor, C. (2004). Modern social imaginaries. Duke University Press. Van Dijck, J. (2014). Datafication, dataism and dataveillance: Big data between scientific paradigm and ideology. Surveillance and Society 12(2): 197–208 Van Dijck, J., & Poell, T. (2016) Understanding the promises and premises of online health platforms. Big Data & Society, 3(1), 1-11. Wilbanks, J. T., & Topol, E. J. (2016). Stop the privatization of health data. Nature, 535, 345-348. Winner, L. (1978). Autonomous Technology – Technics-out-of-Control As a Theme in Political Thought. Cambridge, Massachusetts, & London: The MIT Press. Zuboff, Shoshana. 2015. “Big Other: Surveillance Capitalism and the Prospects of an Information Civilization.” Journal of Information Technology 30: 75–89. 11:30am - 11:45am
Short Paper (10+5min) [publication ready] Digitalisation of Consumption and Digital Humanities - Development Trajectories and Challenges for the Future University of Helsinki, Ruralia Institute Digitalisation transforms practically all areas of the modern life: everything that can, will be digitalised. Especially the everyday routines and consumption practices are under continual change. New digital products and services are introduced at an accelerating pace. Purpose of this article is two-fold: the first aim is to explore the influence of digitalisation on consumption, and secondly, to canvas reasons for these digitalisation-driven transformations and possible future progressions. The transformations are explored through recent consumer studies and the future development is based on interpretations about digitalisation. Our article recounts that digitalisation of consumption have resulted in new forms of e-commerce, changing consumer roles and the digital virtual consumption. Reasons for these changes and expected near future progressions are based on assumptions drawn from data-driven, platform-based and disruption-generated visions. Challenges of combining consumption and the digital humanities approach are discussed in the conclusion Section of the article. 11:45am - 12:00pm
Short Paper (10+5min) [abstract] Its your data, but my algorithms Aalto-University, the school of Arts, Design and Architecture, The world is increasingly digital, but the understanding of how the digital affects everyday life is still often confused. Digitalisation is sometimes optimistically thought as a rescue from hardships, be it economical or even educational. On the other hand, digitalization is seen negatively as something one just can’t avoid. Digital technologies have replaced many previous tools used in work as well as in leisure. Furthermore, digital technologies present an agency of their own into the human processes as marked by David Berry. Through manipulating data through algorithms and communicating not only with humans, but other devices as well, digital technology presents new kind of challenges for the society and individual. These digital systems and data flow get their instructions from the code that runs on these systems. The underneath code itself is not objective nor value-free and carries own biases as well as programmers, software companies or larger cultural viewpoints objectives. As such, digital technology affects to the ways, we structure and comprehend, or are even able to comprehend the world around us. This article looks at the surrounding digitality through an artistic research project. Through using code not as a functional tool but in a postmodern way as a material for expression, the research focuses on how code as art can express the digital condition that might otherwise be difficult to put into words or comprehend in everyday life. The art project consists of a drawing robot controlled by EEG-headband that the visitor can wear. The headband allows the visitor to control the robot through the EEG-readings read by the headband. As such the visitor might get a feeling of being able to control the robot, but at the same time the robot interprets the data through its algorithms and thus controls the visitor's data. The aim of this research projects is to give perspectives to the everydayness of digitality. It wants to question how we comprehend digital in everyday life and asks how we should embody digitality in the future. The benefits of artistic research are in the way it can broaden the conceptions of how we know and as such can deepen one’s understanding of the complexities of the world. Furthermore, artistic research can expand the meaning to alternative interpretations of the research subjects. As such, this research project aims at the same time to deepen the discussion of digitalization and to broaden it to alternative understandings. The alternative ways of seeing a phenomenon, like digitality, are essential in the ways future is developed. The proposed research consists of both the theoretical text and the interactive artwork, which would be present in the conference. |
12:00pm - 12:45pm | Lunch + poster setup Think Corner |
12:45pm - 2:30pm | Poster Slam (lunch continues), Poster Exhibition & Coffee Session Chair: Annika Rockenberger |
Think Corner | |
|
Poster [abstract]
Shearing letters and art as digital cultural heritage, co-operation and basic research Svenska litteratursällskapet i Finland, Albert Edelfelts brev (edelfelt.fi) is a web publication developed at the Society of Swedish Literature in Finland. In co-operation with the Finnish National Gallery, we publish letters of the Finnish artist Albert Edelfelt (1854–1905) combined with pictures of his artworks. Albert Edelfelts brev received in 2016 the State Award for dissemination of information. The co-operation between institutions and basic research of the material has enabled a unique reconstruction of Edelfelt’s artistry and his time, for the service of researchers and other users. I will present how we have done it and how we plan to further develop the website. The website Albert Edelfelts brev launched in September 2014, with a sample of Edelfelt’s letters and paintings. Our intention is to publish all the letters Albert Edelfelt wrote to his mother Alexandra (1833–1901). The collection consists of 1 310 letters, that range over 30 years and cover most of Edelfelt’s adult life. The letters are in the care of the Society of Swedish Literature in Finland. We also have to our disposal close to 7 000 pictures of Edelfelt’s paintings and sketches in the care of the Finnish National Gallery. In the context of digital humanities, the volume of the material at hand is manageable. However, for researchers who think that they might have use of the material, but are unsure of exactly where or what to look for, it might be labour intensive to go through all the letters and pictures. We have combined professional expertise and basic research of the material with digital solutions to make it as easy as possible to take part of what the content can offer. As editor of the web publication, I spend a considerable part of my work on basic research in identifying people, and pinpointing paintings and places that Edelfelt mentions in his letters. By linking the content of a letter to artworks, persons, places and subjects/reference words users can easily navigate in the material. Each letter, artwork and person has a page of its own. Even places and subjects are searchable and listed. The letters are available as facsimile pictures of the handwritten pages. Each letter has a permanent web resource identifier (URN:NBN). In order to make it easier for users to decide if a letter is of interest, we have tagged subjects using reference words from ALLÄRS (common thesaurus in Swedish). We have also written abstracts of the content, divided them into separate “events” and tagged mentioned artworks, people and places to these events. Each artwork of Edelfelt has a page of its own. Here, users find a picture of the artwork (if available) and earlier sketches of the artwork (if available). By looking at the pictures, they can see how the working process of the painting has developed. Users can also follow the process through what Edelfelt writes in his letters. All the events from the letter abstracts that are tagged to the specific artwork are listed in chronological order on the artwork-page. Persons tagged in the letter abstracts also have pages of their own. On a person-page, users find basic facts and links to other webpages with information about the person. Any events from the letter abstracts mentioning the person are listed as well. In other words, through a one-click-solution users can find an overview on everything Edelfelt’s letters have to say about a specific person. Tagging persons to events has also made it possible to build graphs of a person’s social network; based on how many times other persons are tagged to the same events as the specific person. There is a link to these graphs on every person-page. Apart from researchers who have a direct interest in the material, we have also wanted to open up the cultural heritage to a broader public and group of users. Each month the editorial staff writes a blog-post on SLS-bloggen (http://www.sls.fi/sv/blogg). Albert Edelfelts brev also has a profile on Facebook (https://www.facebook.com/albertedelfeltsbrev/) where we post excerpts of letters on the same date as Edelfelt wrote the original letter. By doing so we hope to give the public an insight in the life of Edelfelt and the material, and involve them in the progress of the project. The web publication has open access. The mix of different sources and the co-operation with other heritage institutions has led to a mix of licenses for how users can copy and redistribute the published material. The Finnish National Gallery (FNG) owns copyright on its pictures in the publication and users have to get permission from FNG to copy and redistribute that material. The artwork-pages contain descriptions of the paintings written by the art historian Bertel Hintze, who published a catalogue of Edelfelt’s art in 1942. These texts are licensed with a Creative Commons Attribution-NoDerivs 4.0 Generic (CC BY-ND 4.0). Edelfelt’s letters as well as the texts and metadata produced by the editorial staff at the Society of Swedish Literature in Finland have a Creative Commons CC0 1.0 Universal-license. Data with Creative Commons-license is also freely available as open data through a REST API (http://edelfelt.sls.fi/apiinfo/). In the future, we would like to find a common practice for the user rights; if possible, even so all the material would have the same license. We intend to invite other institutions with artworks of Edelfelt to co-operate, offering the same kind of partnership as the web publication has with the Finnish National Gallery. Thus, we are striving to a complete as possible site with the artworks of Edelfelt. Albert Edelfelt is of national interest and his letters, which he mostly wrote during his stays abroad, contain information of international interest. Therefore, we plan to offer the metadata and at least some of the source material in Finnish and English translations. So far, the letters are only available as facsimile. The development of transcription programs for handwritten texts has made it probable that we in the future could include transcriptions of the letters in the web publication. Linguists especially have an interest in getting a searchable letter transcription for their researches, and the transcriptions would even be helpful for users who might have problem reading the handwritten text. Poster [abstract]
Metadata Analysis and Text Reuse Detection: Reassessing public discourse in Finland through newspapers and journals 1771–1917 1University of Turku; 2University of Helsinki During the period 1771–1917 newspapers developed as a mass medium in the Grand Duchy of Finland. This happened within two different imperial configurations (Sweden until 1809 and Russia 1809–1917) and in two main languages – Swedish and Finnish. The Computational History and the Transformation of Public Discourse in Finland, 1640–1910 (COMHIS) project studies the transformation of public discourse in Europe and in Finland via an innovative combination of original data, state-of-the-art quantitative methods that have not been previously applied in this context, and an open source collaboration model. In this study the project combines the statistical analysis of newspaper metadata and the analysis of text reuse within the papers to trace the expansion of and exchange in Finnish newspapers published in the long nineteenth century. The analysis is based on the metadata and content of digitized Finnish newspapers published by the National library of Finland. The dataset includes full text of all newspapers and most periodicals published in Finland between 1771 and 1920. The analysis of metadata builds on data harmonization and enrichment by extracting information on columns, type sets, publications frequencies and circulation records from the full-text files or outside sources. Our analysis of text reuse is based on a modified version of the Basic Local Alignment Search Tool (BLAST) algorithm, which can detect similar sequences and was initially developed for fast alignment of biomolecular sequences, such as DNA chains. We have further modified the algorithm in order to identify text reuse patterns. BLAST is robust to deviations in the text content, and as such able to effectively circumvent errors or differences arising from optical character recognition (OCR). By relating metadata on publication places, language, number of issues, number of words, size of papers, and publishers and comparing that to the existing scholarship on newspaper history and censorship, the study provides a more accurate bird’s-eye view of newspaper publishing in Finland after 1771. By pinpointing key moments in the development of journalism the study suggest that the while the discussions in the public were inherently bilingual, the technological and journalistic developments advanced at different speeds in Swedish and Finnish language forums. It further assesses the development of the press in comparison with book production and periodicals, pointing towards a specialization of newspapers as a medium in the period post 1860. Of special interest is that the growth and specialization of the newspaper medium was much indebted to the newspapers being established all over the country and thus becoming forums for local debates. The existence of a medium encompassing the whole country was crucial to the birth of a national imaginary. Yet, the national public sphere was not without regional intellectual asymmetries. This study traces these asymmetries by analysing text reuse in the whole newspaper corpus. It shows which papers and which cities functioned as “senders” and “receivers” in the public discourse in this period. It is furthermore essential that newspapers and periodicals had several functions throughout the period, and the role of the public sphere cannot be taken for granted. The analysis of text reuse further paints a picture of virality in newspaper publishing that was indicative of modern journalistic practices but also reveals the rapidly expanding capacity of the press. These can be further contrasted to other items commonly associated with the birth of modern journalism such as publication frequency, page sizes and typesetting of the papers. All algorithms, software, and the text reuse database will be made openly available online, and can be located through the project’s repositories (https://comhis.github.io/ and https://github.com/avjves/textreuse-blast). The results of the text reuse detection carried out in BLAST are stored in a database and will also be made available for the exploration of other researchers. Poster [abstract]
Oceanic Exchanges: Tracing Global Information Networks In Historical Newspaper Repositories, 1840-1914 University of Turku, Oceanic Exchanges: Tracing Global Information Networks in Historical Newspaper Repositories, 1840-1914 (OcEx) is a Digging into Data – Transatlantic Platform funded international and interdisciplinary project with a focus on studying spreading of news globally in the nineteenth century newspapers. The project combines digitized newspapers from Europe, US, Mexico, Australia, New Zealand, and the British and Dutch colonies of that time all over the world. The project examines patterns of information flow, spread of text reuse, and global conceptual changes across national, cultural and linguistic boundaries in the nineteenth century newspapers. The project links the different newspaper corpora, scattered into different national libraries and collections using various kinds of metadata and printed in several languages, into one whole. The project proposes to present a poster in the Nordic Digital Humanities Conference 2018. The project started in June 2017, and the aim of the poster is to present the current status of the project. The research group members come from Finland, the US, the Netherlands, Germany, Mexico, and UK. OcEx’s participating institutions are Loughborough University, Northeastern University, North Carolina State University, Universität Stuttgart, Universidad Nacional Autónoma de México, University College London, University of Nebraska-Lincoln, University of Turku, and Utrecht University. The project’s 90 million newspaper pages come from Australia's Trove Newspapers, the British Newspapers Archive, Chronicling America (US), Europeana Newspapers, Hemeroteca Nacional Digital de México, National Library of Finland, National Library of the Netherlands (KB), the National Library of Wales, New Zealand’s PapersPast, and a strategic collaboration with Cengage Publishing, one of the leading commercial custodians of digitized newspapers. Objectives Our team will hone computational tools, some developed in prior research by project partners and novel ones, into a suite of openly available tools, data, and analyses that trace a broad range of language-related phenomena (including text reuse, translational shifts, and discursive changes). Analysing such parameters enables us to characterize “reception cultures,” “dissemination cultures,” and “reference cultures” in terms of asymmetrical flow patterns, or to analyse the relationships between reporting targeted at immigrant communities and their surrounding host countries. OcEx will leverage existing relationships and agreements between its teams and data providers to connect disparate digital newspaper collections, opening new questions about historical globalism and modeling consortial approaches to transnational newspaper research. OcEx will take up challenging questions of historical information flow, including: 1. Which stories spread between nations and how quickly? 2. Which texts were translated and resonated across languages? 3. How did textual copying (reprinting) operate internationally compared to conceptual copying (idea spread)? 4. How did the migration of texts facilitate the circulation of knowledge, ideas, and concepts, and how were these ideas transformed as they moved from one Atlantic context to another? 5. How did geopolitical realities (e.g. economic integration, technology, migration, geopolitical power) influence the directionality of these transnational exchanges? 6. How does reporting in immigrant and ethnic communities differ from reporting in surrounding host countries? 7. Does the national organization of digitized newspaper archives artificially foreclose globally-oriented research questions and outcomes? Methodology OcEx will develop a semantic interoperable knowledge structure, or ontology, for expressing thematic and textual connections among historical newspaper archives. Even with standards in place, digitization projects pursue differing approaches that pose challenges to integration or particular levels of analysis. In most, for instance, generic identification of items within newspapers has not been pursued. In order to build an ontology, this project will build on knowledge acquired by participating academic partners, such as the project TimeCapsule at Utrecht University, as well as analytical software that has been tested and used by team members, such as viral text analysis. OcEx does not aim to create a totalizing research infrastructure but rather to expose the conditions by which researchers can work across collections, helping guide similar projects in future seeking to bridge national collections. This ontology will be established through comparative investigations of phenomena illustrating textual links: reprinting and topic dissemination. We have divided the tasks into six work packages: WP1: Management ➢ create an international network of researchers to discuss issues of using and accessing newspaper repository data and combine expertise toward better development and management of such data; ➢ assemble a project advisory board, consisting of representatives of public and private data custodians and other critical stakeholders. WP2: Assessment of Data and Metadata ➢ investigate and develop classifier models of the visual features of newspaper content and genres; ➢ create a corpus of annotations on clusters/passages that records relationships among textual versions. WP3: Creating a Networked Ontology for Research ➢ create an ontology of genres, forms, and elements of texts to support that annotation; ➢ select and develop best practices based on available technology (semantic web standard RDF, linked data, SKOS, XML markup standards such as TEI). WP4: Textual Migration and Viral Texts ➢ analyze text reuse across archives using statistical language models to detect clusters of reprinted passages; ➢ perform analyses of aggregate information flows within and across countries, regions, and publications; ➢ develop adaptive visualization methods for results. WP5: Conceptual Migration and Translation Shifts ➢ perform scalable multilingual topic model inference across corpora to discern translations, shared topics, topic shifts, and concept drift within and across languages, using distributional analysis and (hierarchical) polylingual topic models; ➢ analyze migration and translation of ideas over regional and linguistic borders; ➢ develop adaptive visualization methods for the results. WP6: Tools of Delivery/Dissemination ➢ validation of test results in scholarly contexts/test sessions at academic institutions; ➢ conduct analysis of the sensitivity of results to the availability of corpora in different languages and levels of access; ➢ share findings (data structures/availability/compatibility, user experiences) with institutional partners; ➢ package code, annotated data (where possible), and ontology for public release. Poster [abstract]
ArchiMob: A multidialectal corpus of Swiss German oral history interviews 1University of Helsinki, Department of Digital Humanities; 2University of Zurich, CorpusLab, URPP Language and Space Although dialect usage is prevalent in the German-speaking part of Switzerland, digital resources for dialectological and computational linguistic research are difficult to obtain. In this paper, we present a freely available corpus of spontaneous speech in various Swiss German dialects. It consists in transcriptions of video interviews with contemporary witnesses of the Second World War period in Switzerland. These recordings were produced by an association of Swiss historians called Archimob about 20 years ago. More than 500 informants stemming from all linguistic regions of Switzerland (German, French and Italian) and representing both genders, different social backgrounds, and different political views, were interviewed. Each interview is 1 to 2 hours long. In collaboration with the University of Zurich, we have selected, processed and analyzed a subset of 43 interviews in different Swiss German dialects. The goal of this contribution is twofold. First, we describe how the documents were transcribed, segmented and aligned with the audio source and how we make the data available on specifically adapted corpus query engines. We also provide an additional normalization layer in order to reduce the different types of variation (dialectal, speaker-specific and transcriber-specific) present in the transcriptions. We formalize normalization as a machine translation task, obtaining up to 90% of accuracy (Scherrer & Ljubešić 2016). Second, we show through some examples how the ArchiMob resource can shed new lights on research questions from digital humanities in general and dialectology and history in particular: • Thanks to the normalization layer, dialect differences can be identified and compared with existing dialectological knowledge. • Using language modelling, another technique borrowed from language technology, we can compute distances between texts. These distance measures allow us to identify the dialect of unknown utterances (Zampieri et al. 2017), localize transcriber effects and obtain a generic picture of the Swiss German dialect landscape. • Departing from the purely formal analysis of the transcriptions for dialectological purposes, we can apply methods such as collocation analysis to investigate the content of the interviews. By identifying the key concepts and events referred to in the interviews, we can assess how the different informants perceive and describe the same time period. Poster [abstract]
Serious gaming to support stakeholder participation and analysis in Nordic climate adaptation research 1Linköping University,; 2Helsinki University Introduction While climate change adaptation research in the Nordic context has advanced significantly in recent years, we still lack a thorough discussion on maladaptation, i.e. the unintended negative outcomes as a result of implemented adaptation measures. In order to identify and assess examples of maladaptation for the agricultural sector, we developed a novel methodology, integrating visualization, participatory methods and serious gaming. This enables research and policy analysis of trade-offs between mitigation and adaptation options, as well as between alternative adaptation options with stakeholders in the agricultural sector. Stakeholders from the agricultural sector in Sweden and Finland have been engaged in the exploration of potential maladaptive outcomes of climate adaptation measures by means of a serious game on maladaptation in Nordic agriculture, and discussed their relevance and related trade offs. The Game The Maladaptation Game is designed as a single player game. It is web-based and allows a moderator to collect the settings and results for each player involved in a session, store these for analysis, and display these results on a ‘moderator screen’. The game is designed for agricultural stakeholders in the Nordic countries, and requires some prior understanding of the challenges that climate change can impose on Nordic agriculture as well as the scope and function of adaptation measures to address these challenges. The gameplay consists of four challenges, each involving multiple steps. At the start of the game, the player is equipped with a limited number of coins, which decrease for each measure that is selected. As such, the player has to consider the implications in terms of risk and potential negative effects of a selected measure as well as the costs for each of these measures. The player is challenged with four different climate related challenges – increased precipitation, drought, increased occurrence of pests and weeds, and a prolonged growing season - that are all relevant to Nordic agriculture. The player selects one challenge at a time. Each challenge has to be addressed, and once a challenge has been concluded, the player cannot return and revise the selection. When entering a challenge (e.g. precipitation) possible adaptation measures that can be taken to address this challenge in an agricultural context, are displayed as illustrated cards on the game interface. Each card can be turned to receive more information, i.e. a descriptive text and the related costs. The player can explore all cards before selecting one. The selected adaptation measure is then leading to a potential maladaptive outcome, which is again displayed as an illustrated card with an explanatory text on the backside. The player has to decide to reject or accept this potential negative outcome. If the maladaptive outcome is rejected, the player returns to the previous view, where all adaptation measures for the current challenge are displayed, and can select another measure, and make the decision whether to accept or reject the potential negative outcome that is presented for these. In order to complete a challenge, one adaptation measure with the related negative outcome has to be accepted. After completing a challenge, the player returns to the entry page, where, in addition to the overview of all challenges, a small scoreboard summarizes the selection made, displays the updated amount of coins as well as a score of maladaptation-points. These points represent the negative maladaptation score for the selected measures and are a measure that the player does not know prior to making the decision. The game continues until selections have been made for all four challenges. At the end of the game, the player has an updated scoreboard with three main elements: the summary of the selections made for each challenge, the remaining number of coins, and the total sum of the negative maladaptation score. The scoreboards of all players involved in a session appear now on the moderator screen. This setup allows the individual player to compare his or her pathways and results with other players. The key feature of the game is hence the stimulation of discussions and reflections concerning adaptation measures and their potential negative outcomes, both with regard to adding knowledge about adaptation measures and their impact as well as the threshold of when an outcome is considered maladaptive, i.e. what trade offs are made within agricultural climate adaptation. Preliminary conclusions from the visualization supported gaming workshops During autumn 2016, eight gaming workshops were held in Sweden and Finland. These workshops were designed as visualization supported focus groups, allowing for some general reflections, but also individual interaction with the web-based game. Stakeholders included farmers, agricultural extension officers, and representatives of branch organizations as well as agricultural authorities on the national and regional level. Focus group discussions were recorded and transcribed in order to analyze the empirical results with focus on agricultural adaptation and potential maladaptive outcomes. Preliminary conclusions from these workshops point towards several issues that relate both to content and functionality of the game. While, as a general conclusion, the stakeholders were able to quickly get acquainted with the game and interact without larger difficulties, some few individual participants were negative to the general idea of engaging with a game to discuss these issues. The level of interactivity that the game allows, where players can test and explore, before making a decision, enabled reflections and discussions also during the gameplay. Stakeholders frequently tested and returned to some of the possible choices before deciding on their final setting. Since the game demands the acceptance of a potential negative outcome, several stakeholders described their impression of the game as a ‘pest or cholera’ situation. In terms of empirical results, the workshops generated a large number of issues regarding the definition of maladaptive outcomes and their thresholds, in relation to contextual aspects, such as temporal and spatial scales, as well as reflections regarding the relevance and applicability of the proposed adaptation measures and negative outcomes. Poster [abstract]
Challenges in textual criticism and editorial transparency Svenska litteratursällskapet i Finland, Henry Parlands Skrifter (HPS) is a digital critical edition of the works and correspondence of the modernist author Henry Parland (1908–1930). The poster presents chosen strategies for communicating the results of the process of textual criticism in a digital environment. How can we make the foundations for editorial decisions transparent and easily accessible to a reader? Textual criticism is by one of several definitions “the scientific study of a text with the intention of producing a reliable edition” (Nationalencyklopedin, “textkritik”. Our translation.) When possible, the texts of the HPS edition are based on original prints whose publication was initiated by the author during his lifetime. However, rendering a reliable text largely requires a return to original manuscripts as only a fraction of Parland’s works were published before the author’s death at the age of 22 in 1930. Posthumous publications often lack reliability due to the editorial practices and sometimes primarily aesthetic solutions to text problems of later editors. The main structure of the Parland digital edition is related to Zacharias Topelius Skrifter (topelius.sls.fi) and similar editions (e.g. grundtvigsværker.dk). However, the Parland edition has foregone the system of a – theoretically – unlimited amount of columns in favour of only two fields for text: a field for the reading text, which holds a central position on the webpage, and a smaller, optional, field containing, in different tabs, editorial commentary, facsimiles and transcriptions of manuscripts and original prints. The benefit of this approach is easier navigation. If a reader wishes to view several fields at once, they may do so by using several browser windows, which is explained in the user’s guide. The texts of the edition are transcribed in XML and encoded following TEI (Text Encoding Initiative) Guidelines P5. Manuscripts, or original prints, and edited reading texts are rendered in different files (see further below). All manuscripts and original prints used in the edition are presented as high-resolution facsimiles. The reader thus has access to the different versions of the text in full, as a complement to the editorial commentary. Parland’s manuscripts often contain several layers of changes (additions, deletions, substitutions): those made by the author himself during the initial process of writing or during a later revision, and those made by posthumous editors selecting and preparing manuscripts for publication. The editor is thus required to analyse the manuscripts in order to include only changes made by the author in the text of the edition. The posthumous changes are included in the transcriptions of the manuscripts and encoded using the same TEI elements as the author’s changes with an addition of attributes indicating the other hand and pen (@hand and @medium). In the digital edition these changes, as well as other posthumous markings and notes, are displayed in a separate colour. A tooltip displays the identity of the other hand. One of the benefits of this solution is transparency towards the reader through visualization of the editor’s interpretation of all sections of the manuscript. The using of standard TEI elements and attributes facilitate possible use of the XML-documents for purposes outside of the edition. For the Parland project, there were also practical benefits concerning technical solutions and workflow in using mark-up that had already, though to a somewhat smaller extent, been used by the Zacharias Topelius edition. The downside to using the same elements for both authorial and posthumous changes is that the XML-file will not very easily lend itself to a visualization of the author’s version. Although this surely would not be impossible with an appropriately designed stylesheet, we have deemed it more practical to keep manuscripts and edited reading texts in separate files. All posthumous intervention and associated mark-up are removed from the edited text, which has the added practical benefit of making the XML-document more easily readable to a human editor. However, the information value of the separate files is more limited than that of a single file would be. The file with the edited text still contains the complete author’s version, according to the critical analysis of the editor. Editorial changes to the author’s text are grouped together with the original wording in the TEI-element choice and the changes are visualized in the digital edition. The changed section is highlighted and the original wording displayed in a tooltip. Thus, the combination of facsimile, transcription and edited text in the digital edition visualizes the editor’s source(s), interpretation and changes to the text. Sources Nationalencyklopedin, “textkritik”. http://www.ne.se/uppslagsverk/encyklopedi/lång/textkritik (accessed 2017-10-19). Poster [publication ready]
Digitizing the Icelandic-Danish Blöndal Dictionary The Árni Magnússon Institute for Icelandic Studies, Iceland, The Icelandic-Danish dictionary, compiled by Sigfús Blöndal in the early 20th century is being digitized. It is the largest dictionary ever published in Icelandic, containing in total more than 150,000 entries. The digitization work started with a pilot project in 2016 resulting in a comprehensive plan on how to carry out the task. The paper describes the ongoing work, methods and tools applied as well as the aim of the project and rationale. We opted for using OCR and not double-keying, which has become common for similar projects. First results suggest the outcome is satisfactory, as the final version will be proofread. The entries are annotated with XML-entities, using a workbench built for the project. We apply automatic annotation for the most consistent entities, but other annotation is carried out manually. The data is then exported into a relational database, proofread and finally published. Publication date is set for spring 2020. Poster [abstract]
Network visualization for historical corpus linguistics: externally-defined variables as node attributes University of Oslo, In my poster presentation, I will explore whether and how network visualization can benefit philological and historical-linguistic research. This will be implemented by examining the usability of network visualization for the study of early medieval Latin scribes' language competences. Thus, the scope is mainly methodological, but the proposed methodological choices will be illustrated by applying them to a real data set. Four linguistic variables extracted corpus-linguistically from a treebank will be examined: spelling correctness, classical Latin prepositions, genitive plural form, and <ae> diphthong. All the four are continuous, which is typical of linguistic variables. The variables represent different domains of language competence of the scribes who learnt written Latin practically as a second-language by that time. Even more linguistic features will be included in the analysis if my ongoing project proceeds as planned. Thus, the primary objective of the study is to find out whether the network visualization approach has demonstrable advantages compared to ordinary cross-tabulations as far as support to philological and historical-linguistic argumentation is concerned. The main means of visualization will be the gradient colour palette in Gephi, a widely used open-source network analysis and visualization software package. As an inevitable part of the described enterprise, it is necessary to clarify the scientific premises for the use of network environment to display externally-defined values of linguistic variables. It is obvious that in order to be utilized for research purposes, network visualization must be as objective and replicable as possible. By way of definition, I emphasize that the proposed study will not deal with linguistic networks proper, i.e. networks which are directly induced or synthesized from a linguistic data set and represent abstract relations between linguistic units. Consequently, no network metric will be calculated, even though that might be interesting as such. What will be visualized are the distributions of linguistic variables that do not arise from the network itself, but are derived externally from a medium-sized treebank by exploiting its lemmatic, morphological, and, hopefully, also syntactic annotation layers. These linguistic variables will be visualized as attributes of the nodes in the trimodal "social" network which consists of the documents, persons, and places that underlie the treebank. These documents, persons, and places are encoded as the metadata in the treebank. The nodes are connected to each other by unweighted edges. The number of document nodes is 1,040, scribe nodes 220, and writing place nodes 84. In most cases, the definition of the 220 writer nodes is straightforward, given that the scribes scrupulously signed what they wrote, with the exception of eight documents. The place nodes are more challenging. Although 78% of the documents has been written in the city of Lucca, the disambiguation and re-grouping of small localities of which little is known was time-consuming and the results not always fully satisfying. The nodes will be set on the map background by utilizing Gephi's Geo Layout and Force Atlas 2 algorithms. The linguistic features that will be visualized reflect the language change that took place in late Latin and early medieval Latin, roughly the 3rd to 9th centuries AD. The features are operationalized as variables which quantify the variation of those features in the treebank. This quantification is based on the numerical output of a plethora of corpus-linguistic queries which extract from the treebank all constructions or forms that meet the relevant criteria. The variables indicate the relative frequency of the examined features in each document, scribe, and writing place. For the scribes and writing places, the percentages are calculated by counting the occurrences within all the documents written by that scribe or in that place, respectively. The resulting linguistic variables are continuous, hence the practicality of the gradient colouring. In order to ground colouring in the statistical dispersion of the variable values and to conserve maximal visual effect, I customize the Gephi default red-yellow-blue palette so that the maximal yellow, which stands for the middle of the colour scale, marks the mean of the distribution of each variable. Likewise, the thresholds of the maximal red and maximal blue are set equally far from the mean. I chose that distance to be two standard deviations away from the mean. In this way, only around 2.5% of the nodes with the lowest and highest values at both ends of the distribution are maximally saturated with red and blue while the rest, around 95%, of the nodes features a gradient colour, including the maximal yellow in the between. Following this rule, I will illustrate the variables both separately and as a sum variable. The images will be available in the poster. The sum variable will be calculated by aggregating the standardized simple variables. The preliminary conclusions include the observation that network visualization, as such, is not a sufficient basis for philological or historical-linguistic argumentation, but if used along with statistical approach, it can support argumentation by drawing attention to unexpected patterns and – on the other hand – to irregularities. However, it is the geographical layout of the graphs that gives the most of the surplus in regard to traditional approaches: it helps in perceiving patterns that would have otherwise failed to be noticed. The treebank on which the analyses are based is the Late Latin Charter Treebank (version 2, LLCT2), which consists of 1,040 early medieval Latin documentary texts (c. 480,000 words). The documents have been written in historical Tuscia (Tuscany), Italy, between AD 714 and 897, and are mainly sale or purchase contracts or donations, accompanied by a few judgements as well as lists and memoranda. LLCT2 is still under construction and only the first half of it is already provided with the syntactically annotated layer, thus making it a treebank proper (i.e. LLCT, version 1). The lemmatization and morphological annotation style are based on the Ancient Greek and Latin Dependency Treebank (AGLDT) style which can be deduced from the Guidelines for the Syntactic Annotation of Latin Treebanks. Korkiakangas & Passarotti (2011) define a number of additions and modifications to these general guidelines which are designed for Classical Latin. For a more detailed description of the LLCT2 and the underlying text editions, see Korkiakangas (in press). Documents are privileged material for examining the spoken/written interface of early medieval Latin, in which the distance between the spoken and written codes had grown considerable by the Late Antiquity. The LLCT2 documents have precise dating and location metadata and they survive as originals. Bibliography Adams J.N. Social variation and the Latin language. Cambridge University Press (Cambridge), 2013. Araújo T. and Banisch S. Multidimensional Analysis of Linguistic Networks. Mehler A., Lücking A., Banisch S., Blanchard P. and Job, B. (eds) Towards a Theoretical Framework for Analyzing Complex Linguistic Networks. Springer (Berlin, Heidelberg), 2016, 107-131. Bamman D., Passarotti M., Crane G. and Raynaud S. Guidelines for the Syntactic Annotation of Latin Treebanks (v. 1.3), 2007 http://nlp.perseus.tufts.edu/syntax/treebank/ldt/1.5/docs/guidelines.pdf. Barzel B. and Barabási A.-L. Universality in network dynamics. Nature Physics. 2013;9:673-681. Bergs A. Social Networks and Historical Sociolinguistics: Studies in Morphosyntactic Variation in the Paston Letters. Walter de Gruyter (Berlin), 2005. Ferrer i Cancho R. Network theory. Hogan P.C. (ed.) The Cambridge Encyclopedia of the Language Sciences. Cambridge University Press (Cambridge), 2010, 555–557. Korkiakangas T. (in press) Spelling Variation in Historical Text Corpora: The Case of Early Medieval Documentary Latin. Digital Scholarship in the Humanities. Korkiakangas T. and Lassila M. Abbreviations, fragmentary words, formulaic language: treebanking medieval charter material. Mambrini F., Sporleder C. and Passarotti M. (eds) Proceedings of the Third Workshop on Annotation of Corpora for Research in the Humanities (ACRH-3), Sofia, December 13, 2013. Bulgarian Academy of Sciences (Sofia), 2013, 61-72. Korkiakangas T. and Passarotti M. Challenges in Annotating Medieval Latin Charters. Journal of Language Technology and Computational Linguistics. 2011;26,2:103-114. Poster [abstract]
Approaching a digital scholarly edition through metadata Svenska litteratursällskapet i Finland r.f. This poster presents a flowchart with an overview of the database structure in the digital critical edition of Zacharias Topelius Skrifter (ZTS). It shows how the entity relations open a possibility for the user to approach the edition from other angles than the texts, using informative metadata through indexing systems. Through this data, a historian can easily capture for example events, meetings between people or editions of books, as they are presented in Zacharias Topelius’ (1818–1898) texts. Presented here are both already available features and features in progress. ZTS comprises eight digital volumes hitherto, the first published in 2010. This includes the equivalent of about 8 500 pages of text by Topelius, 600 pages of introduction by editors and 13 000 annotations. The published volumes cover poetry, short stories, correspondences, children’s textbooks, historical-geographical works and university lectures on history and geography. It is freely accessible at topelius.sls.fi. Genres still to be published include children’s books, novels, journalism, academica, diaries and religious texts. DATABASE STRUCTURE The ZTS database structure consists of six connected databases: people, places, bibliography, manuscripts, letters and a chronology. So far, the people database consists of about 10 000 unique persons, and a possibility to link them to a family or group level (250 records). It has separate chapters for mythological persons (500 records) and fictive characters (250 records). The geographic database has 6 000 registered places. The bibliographic database has 6 000 editions divided on 3 500 different works, and the manuscript database has 1 400 texts on 350 physical manuscripts. The letter database has 4 000 registered letters to and from Topelius, divided on 2 000 correspondences. The chronology of Topelius life has 7 000 marked events. The indexing of objects started in 2005, using the FileMaker system. New records are continuously added and the work with finding more possibilities on how to use, link and present the data is in constant progress. The users can freely access the information in database records that link to the published volumes. The bibliographic database is the most complex database. The structure follows the Functional Requirements for Bibliographic Records (FRBR) model, which means we are making a difference between the abstract work and the published manifestations (editions) of that work. The FRBR focuses on the content relationship and continuum between the levels; anything regarded a separate work starts as a new abstract record, from where its own editions are created. Within ZTS, the abstract level has a practical significance, in cases when it is impossible to determine to which exact edition Topelius is referring. Also taken in consideration is that for example articles and short stories can have their own independent editions as well as being included in editions (e.g. a magazine, an anthology). This requires two different manifestation levels subordinated the abstract level; the regular editions and the texts included in other editions, the records of the latter type must always link to records of the former. The manuscript database has a content relationship to the bibliographic database through the abstract entity of a work. A manuscript text can be regarded as an independent edition of a work in this context (a manuscript that was never published can easily have a future edition added in the bibliographic database). The manuscript text itself might share physical paper with another manuscript text. Therefore, the description of the physical manuscript is created on a separate level in the manuscript database, to which the manuscript text is connected. The letter database follows the FRBR model; an upper level presents the whole correspondence between Topelius and another person, and a subordinated level describes each physical letter within the correspondence. It is possible to attach additional corresponding persons to occasional letters. The people database connects to the letter database and the bibliographic database, creating a one-to-many relationship. Any writer or author has to be in the people database in order to have their information inserted into these two databases. Within the people database there is also a family or group level, where family members can be grouped, but in contrary to the letter database, this is not a superordinate level. The geographic database follows a one-level structure. Places in letters and manuscripts can be linked from the geographic database. The chronology database contains manually added key events from Topelius’ life, as well as short diary entries made by him in various calendars during his life. It also has automatically gathered records from other databases, based on marked dates when Topelius works were published or when he wrote a letter or a manuscript. The dates of birth and/or death of family members and close friends can be linked from the people database. POSSIBILITIES FOR THE USER Approaching a digital scholarly edition with over 8 500 pages can be a heavy task, and many will likely use the edition more as an object to study, rather than texts to read. For a user not familiar with the content of the different volumes, but still looking for specific information, advanced searches and indexing systems offer a faster path into the relevant text passages. The information in the ZTS database records provides a picture of Finland in the 19th century as it appears in Topelius’ works and life. A future feature for users is access to this data through an API (Application Programming Interface). This will create opportunities for the user to take advantage of the data in any wanted way: to create a 19th century bookshelf, an app for the most popular 19th century names or a map of popular student hangouts in 1830’s Helsinki. Through the indexes formed by the linked data from the texts, the user can find all the occurrences of a person, a place or a book in the whole edition. One record can build a set of ontological relations, and the user can follow a theme, while moving between texts. A search for a person will provide the user with information about where Topelius mentions this person, whether it is in a letter, in his diaries or in a textbook for schoolchildren, or if he possibly meets or interacts with the person. Furthermore, the user can see if this person was the author, publisher or perhaps translator of a book mentioned by Topelius in his texts, or if the editors of ZTS have used the book as a source for editorial comments. The user will also be able to get a list of letters the person wrote to or received from Topelius. The geographic index can help the user create a geographic ontology with an overview of Topelius’ whereabouts through the annotated mentions of places in Topelius’ diaries, letters and manuscripts. The chronology creates a base for a timeline that will not only give the user key events from Topelius’ life, but also links to the other database records. Encoded dates in the XML files (letters, diaries, lectures, manuscripts etc.) can lead the user directly to the relevant text passages. The relation between the bibliographic database and the manuscript database creates a complete bibliography over everything Topelius wrote, including all known manuscripts and editions that relate to a specific work. So far, there are 900 registered independent works by Topelius in the bibliographic database; these works are implemented in 300 published editions (manifestations) and 2 900 text versions included in those manifestations or in other independent manifestations. The manuscript database consists of 1 400 manuscript texts. The FRBR model offers different ways of structuring the layout of a bibliography according to the user’s needs, either through the titles of the abstract works with subordinate manifestations, or directly through the separate manifestations. The bibliography can be limited to show only editions published during Topelius’ lifetime, or to include later editions as well. Furthermore, the bibliography points the user to the published texts and manuscripts of a specific work in the ZTS edition and to text passages where the author himself discusses the work in question. The level of detail is high in the records. For example, we register different name forms and spellings (Warschau vs Warszawa). Such information is included in the index search function and thereby eliminates problems for the end user trying to find information. Topelius often uses many different forms and abbreviations, and performing an advanced search in the texts would seldom give a comprehensive result in these cases. The letter database includes reference words describing the contents of the correspondences. Thus, the possibilities for searching in the material are expanded beyond the wordings of the original texts. Poster [publication ready]
A Tool for Exploring Large Amounts of Found Audio Data KTH Royal Institute of Technology, We demonstrate a method and a set of open source tools (beta) for non-sequential browsing of large amounts of audio data. The demonstration will contain first versions of a set of varied functionalities in their first stages, and will provide a good insight in how the method can be used to browse through large quantities of audio data efficiently. Poster [publication ready]
The PARTHENOS Infrastructure PIN SCrl, PARTHENOS around two ERICs from the Humanities and Arts sector, DARIAH and CLARIN, along with ARIADNE, EHRI, CENDARI, CHARISMA and IPERION-CH and will deliver guidelines, standards, methods, pooled services and tools to be used by its partners and all the research community. Four broad research communities are addressed – History, Linguistic Studies, Archaeology, Heritage and Applied Disciplines and the Social Sciences. By identifying the common needs, PARTHENOS will support cross disciplinary research and provide innovative solutions. By applying the FAIR data principles to structure the work on common policies and standards, the project has produced tools to assist researchers to find and apply the appropriate ones for their areas of interest. A virtual research environment will enable the discovery and use of data and tools and further support is provided with a set of online training modules. Poster [abstract]
Using rolling.classify on the Sagas of Icelanders: Collaborative Authorship in Bjarnar saga Hítdælakappa Russian State Academy of Science, Institute of Slavonic Studies This poster will present the results of an application of the rolling.classify function in Stylo (R) to the source with an unknown authorship and extremely poor textual history – Bjarnar saga Hítdælakappa, one of the medieval Sagas of Icelanders. This case study sets the usual for Stylo authorship attribution goal aside and concentrates on the composition of the main witness of Bjarnar saga, ms. AM 551 d α, 4to (17th c.), which was the source for the most of Bjarnar saga existing copies. It aims not only to find and visualise new arguments for the working hypothesis about the AM 551 d α, 4to composition but also to touch upon main questions that rise before a student of philology daring to use Stylo on the Old Icelandic saga ground, i.e. what Stylo tells us, what it does not, and how can one use it while exploring the history of a text that exists only in one source. It has been noticed that Bjarnar saga shows signs of a stylistic change between the first 10 chapters and the rest of the saga – the characters suddenly change their behaviour (Sígurður Nordal 1938, lxxix; Andersson 1967, 137-140), the narrative becomes less coherent and, as it seems, acquires a new logic of construction (Finlay 1990-1993, 165-171). More detailed narrative analysis of the saga showed that there is a difference in the usage of some narrative techniques in the first and the second parts, i.e., for example, the narrator’s work with point of view and the amount of their intervention in the saga text (Glebova 2017, 45-57). Thus, the question is – what is the relationship between the first 10 chapters and the rest of Bjarnar saga? Is the change entirely compositional and motivated by the narrative strategy of the medieval compiler or it is actually a result of a compilation of two texts that have two different authors? As it often happens with sagas, the problem aggravates due to the Bjarnar saga poor preservation. There is not much to compare and work with; the most of the saga witnesses are copies from one 17th c. manuscript, AM 551 d α, 4to (Boer 1893, xii-xiv; Sígurður Nordal 1938, xcv-xcvii; Simon 1966 (I), 19-149). This manuscript also has its flaws as it has two lacunae, one in the very beginning of the saga (ch. 1-5,5 in ÍF III) and another in the middle (between ch. 14-15 in ÍF III). The second lacuna is unreconstructable while the first one is usually substituted by a fragment from the saga’s short reduction that was preserved in copies of 15th c. kings’ saga compilation, Separate saga St. Olaf in Bœjarbók (Finlay 2000, xlvi), and that actually ends right on the 10th chapter of the longer version. It seems that the text of the shorter version is a variant of the longer one (Glebova 2017, 13-17) and it has a reference that there has been more to the story but it was shortened; precise relationships between the short and long reductions, however, are impossible to reconstruct due to the lacuna in AM 551 d α, 4to. The existence of the short version with these particular length and contents is indeed very important to the study of Bjarnar saga composition in AM 551 d α, 4to as it creates a chance that the first 10 chapters of AM 551 d α, 4to could exist separately at some point of the Bjarnar saga’s text history or at least that these chapters were seen by the medieval compilers as something solid and complete. This would be the last word of the traditional philology concerning this case – the state of the sources does not allow saying more. Thus, is there anything else that could shed some light on the question whether these chapters existed separately or they were written by the same hand? In this study it was decided to try sequential stylometric analysis available in Stylo package for R (Eder, Kestemont, Rybicki 2013) as a function rolling.classify (Eder 2015). As we are interested in the different parts of the same text, rolling stylometry seems to be a more preferable method than cluster analysis, which takes the whole text as an entity and compares it to the reference corpus; alternatively, in case with rolling stylometry the text is divided into smaller segments that allows a deeper investigation of the stylistic variation in the text itself (Rybicki, Eder, Hoover 2016, 126). To do the analysis there was made a corpus from the two parts of Bjarnar saga and several other Old Icelandic sagas; the whole corpus was taken from sagadb.org in Modern Icelandic normalised orthography. Several tests were conducted, first, with one of the parts as a test set and then with another; a sample size from 5000 words to 2000. The preliminary results show that there is a stylistic division in the saga as the style of the first part is not present in the second one and vice versa. This would be an additional argument for the idea that the first 10 chapters existed separately and were added by the Bjarnar saga compiler during the saga construction. One could argue that it could be not an authorial but a generic division as the first part is set in Norway and deals a lot with St. Olaf; the change of genre could result in the change of style. However, Stylo counts the most frequent words, which are not so generically specific (like og, að, etc.); thus, the collaborative authorship still could have taken place. This would be an important result in context of the overall composition of the Bjarnar saga longer version as its structure shows traces of a very careful planning and also mirror composition (Glebova 2017, 18-33): could it be that the structure of one of the parts (maybe, the first one) influenced the other? Whatever be the case, while sewing together the existing material, the medieval compiler made an effort to create a solid text and this effort is worth studying with more attention. Bibliography: Andersson, Theodor M. (1967). The Icelandic Family Saga: An Analytic Reading. Cambridge, MA. Boer, Richard C. (1893). Bjarnar saga Hítdælakappa, Halle. Eder, M. (2015). “Rolling Stylometry.” Digital Scholarship in the Humanities, Vol. 31-3: 457–469. Eder, M., Kestemont, M., Rybicki, J. (2013). “Stylometry with R: A Suite of Tools.” Digital Humanities 2013: Conference Abstracts. University of Nebraska–Lincoln: 487–489. Finlay, A. “Nið, Adultery and Feud in Bjarnar saga Hítdælakappa.” Saga-Book of the Viking Society 23 (1990-1993): 158-178. Finlay, A. The Saga of Bjorn, Champion of the Men of Hitardale, Enfield Lock, 2000. Glebova D. A Case of An Odd Saga. Structure in Bjarnar saga Hítdælakappa. MA thesis, University of Iceland. Reykjavík, 2017 (http://hdl.handle.net/1946/27130). Rybicki, J., Eder, M., Hoover, David L. “Computational Stylistics and Text Analysis.” In Doing Digital Humanities: Practice, Training, Research, edited by Constance Compton, Richard J. Lane, Ray Siemens. London, New York: 123-144. Sigurður Nordal, and Guðni Jónsson (eds.) “Bjarnar saga Hítdælakappa.” In Borgfirðinga sögur, Íslenzk fornrit 3, 111-211. Reykjavík, 1938. Simon, John LeC. A Critical Edition of Bjarnar saga Hítdælakappa. Vol. 1-2. Unpublished PhD thesis, University of London, 1966. Poster [abstract]
The Bank of Finnish Terminology in Arts and Sciences – a new form of academic collaboration and publishing University of Helsinki, This presentation concerns the multidisciplinary research infrastructure project “Bank of Finnish Terminology in Arts and Sciences (BFT)” as an innovative form of academic collaboration and publishing. The BFT, which was launched in 2012, aims to build a permanent and continuously updated terminological database for all fields of research in Finland. Content for the BFT is created by niche-sourcing, where the participation is limited to a particular group of experts in the participating subject fields. The project maintains a wiki-based website which offers an open and collaborative platform for terminological work and a discussion forum available to all registered users. The BFT thus opens not only the results but the whole academic procedure where the knowledge is constantly produced, evaluated, discussed and updated in an ongoing process. The BFT also provides an inclusive arena for all the interested people – students, journalists, translators and enthusiasts – to participate in the discussions relating to concepts and terms in Finnish research. Based on the knowledge and experiences accumulated during the BFT project we will reflect on the benefits, challenges, and future prospects of this innovative and globally unique approach. Furthermore, we will consider the possibilities and opportunities opening up especially in terms of digital humanities. Poster [publication ready]
The Swedish Language Bank 2018: Research Resources for Text, Speech, & Society 1University of Gothenburg; 2KTH Royal Institute of Technology; 3The Institute for Language and Folklore We present an expanded version of the Swedish research resource the Swedish Language Bank. The Language Bank, which has supported national and inter-national research for over four decades, will now add two branches, one focus-ing on speech and one on societal aspect of language, to its existing organiza-tion, which targets text. Poster [abstract]
Handwritten Text Recognition and 19th Century Court Records National Archives Finland, This paper will demonstrate how the READ project is developing new technologies that will allow computers to automatically process and search handwritten historical documents. These technologies are brought together in the Transkribus platform, which can be downloaded free of charge at https://transkribus.eu/Transkribus/. Transkribus enables scholars with no in-depth technological knowledge to freely access and exploit algorithms which can automatically process handwritten text. Although there is already a rather sound workflow in place, the platform needs human input in order to ensure the quality of the recognition. The technology must be trained by being shown examples of images of documents and their accurate transcriptions. This helps it to understand the patterns which make up characters and words. This training data is used to create a Handwritten Text Recognition model which is specific to a particular collection of documents. The more training data there is, the more accurate the Handwritten Text Recognition can become. Once a Handwritten Text Recognition model has been created, it can be applied to other pages from the same collection of documents. The machine analyses the image of the handwriting and then produces textual information about the words and their position on the page, providing best guesses and alternative suggestions for each word, with measures of confidence. This process allows Transkribus to provide the automatic transcription and full-text search of a document collection at high levels of accuracy. For the quality of the text recognition, the amount of training material is paramount. Current tests suggest that models for specific style of handwriting can reach a Character Error Rate of less than 5%. Transcripts with a Character Error Rate of 10% or below can be generally understood by humans and used for adequate keyword searches. A low Character Error Rate also makes it relatively quick and easy for human transcribers to correct the output of the Handwritten Text Recognition engine. These corrections can then be fed back into the model in order to make it more accurate. These levels also compare favorably with Optical Character Recognition, where 95-98% accuracy for early prints is possible. Of even more interest is the fact that a well-trained model is able to sustain a certain amount of differences in handwriting. Therefore, it can be expected that, with a large amount of training material, it will be possible to recognize the writing of an entire epoch (e.g. eighteenth-century English writing), in addition to that of specific writers. The case study of this paper is the Finnish court records from the 19th century. The notification records which contain cases concerning guardianships, titles and marriage settlements, form an enormous collection of over 600 000 pages. Although the material is in digital form, the usability is still poor due to the lack of indices or finding aids. With the help of the Handwritten Text Recognition the National Archives have the chance to provide the material in computer-readable form which allows users to search and use the records in whole new way. Poster [publication ready]
An approach to unsupervised ontology term tagging of dependency-parsed text using a Self-Organizing Map (SOM) University of Helsinki Tagging ontology-based terms on existing text content is a task often requiring human effort. Each ontology may have their own structure and schema for describing terms, making automation non-trivial. I suggest a machine learning estimation technique for term tagging which can learn semantic tagging from a set of sample ontologies with given textual examples, and expand its use for analyzing a large text corpus by comparing the found syntactic features in the text. The tagging technique is based on a dependency parsed text input and an unsupervised machine learning model, the Self-Organizing Map (SOM). Poster [abstract]
Comparing Topic Model Stability Between Finnish, Swedish and French University of Helsinki Comparing Topic Model Stability Between Finnish, Swedish and French 1 Abstract In the recent years, topic modelling has gained increasing attention in the humanities. Unfortunately, little has been done to determine whether the output produced by this range of probabilistic algorithms is revealing signal or merely producing noise, nor how well it performs on other languages than English. In this paper, we set out to compare topic models of parallel corpora in Finnish, Swedish, and French, and propose a method to determine how well the topic modelling algorithms perform on those languages. 2 Context Topic modelling (TM) is a well-known (following the work of (4; 5)) yet badly understood range of algorithms within the humanities. While a variety of studies within the humanities make use of topic models to answer historical questions (see (2) for a thorough survey), there is no tried and true method that ascertains that the probabilistic algorithm reveals signal and is not merely responding to noise. The rule of thumb is generally that if the results are interesting and reveal a prior intuition by a domain expert, they are considered correct -- in the sense that they are a valid entry point into a humongous dataset, and that the proper work of historical research is to be then manually carried out on a subset selected by the algorithm. As pointed out in previous work (7; 3), this, combined with the fact that many humanistic corpora are on the small side, "the threshold for the utility of topic modelling across DH projects is as yet highly unclear." Similarly, topic instability "may lead to research being based on incorrect foundational assumptions regarding the presence or clustering of conceptual fields on a body of work or source material" (3). Whilst topic modelling techniques are considered language-independent, i.e. "use[] no manually constructed dictionaries, knowledge bases, semantic networks, grammars, syntactic parsers, or morphologies, or the like" (6), they encode keyassumptions about the statistical properties of language. These assumptions are often developed with English in mind and generalised to other languages without much consideration. We maintain that these algorithms are not language-independent, but language-agnostic at best, and that accounting for discrepancies in how different languages are processed by the same algorithms is necessary basic research for more applied, context-oriented research -- especially for the historical development of public discourses in multilingual societies or phenomena where structures of discourse flow over language borders. Indeed, some languages heavily rely on compounding -- the creation of a word through the combination of two or more stems -- in word formation, while others use determiners to combine simple words. If one considers a white space as the delimitation between words (as is usually done with languages making use of the Latin alphabet), the first tendency results in a richer vocabulary than the second, hence influencing TM algorithms that follow of the bag-of-words approach. Similarly, differences in grammar -- for example, French adjectives must agree in gender and number with the noun they modify, something that does not exist in other languages like English -- reinforce those discrepancies. Nonetheless, most of this happens in the fuzzy and non-standard preprocessing stage of topic modelling, and the argument could be made that the language neutrality of TM algorithms rests more on it being underspecified with regard to how to pre-process the language. In this paper, we propose to compare topic models on a custom-made parallel corpus in Finnish, Swedish, and French. By selecting those languages, we have a glimpse of how a selection of different languages are processed by TM algorithms. While concentrating on languages spoken in Europe and languages of interest of our collaborative network of linguists, historians and computer scientists, we are still able examine two crucial variables: one of genetic and one of cultural relatedness. French and Swedish belong to Indo-European (Romance and Germanic branches, respectively) and Finnish is a Finno-Ugrian language. Finnish and Swedish on the other hand share a long history of close language contact and cultural convergence. Because of this, Finnish contains a large number of Swedish loan words, and, perceivably, similar conceptual systems. 3 Methodology To explore our hypothesis, we use a parallel corpus of born-digital textual data in Finnish, Swedish, and French. Once the corpus is constituted, it becomes possible to apply LDA (1) and HDA (9) -- LDA is parametrised by humans, whereas HDA will attempt to automatically determine the best configuration possible. The resulting models for each language are stored, the corpora reduced in size, LDA is re-applied, the models are stored, corpora re-reduced, etc. Topic models are compared manually between languages at each stage, and programmatically between stages, using the Jaccard Index (8), for all languages. The same workflow is then applied to the lemmatised version of the above-mentioned corpora, and results compared. Bibliography [1] Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993{1022 (2003) [2] Brauer, R., Fridlund, M.: Historicizing topic models, a distant reading of topic modeling texts within historical studies. In: International Conference on Cultural Research in the context of \Digital Humanities", St. Petersburg: Russian State Herzen University (2013) [3] Hengchen, S., O'Connor, A., Munnelly, G., Edmond, J.: Comparing topic model stability across language and size. In: Proceedings of the Japanese Association for Digital Humanities Conference 2016 (2016) [4] Jockers, M.L.: Macroanalysis: Digital methods and literary history. University of Illinois Press (2013) [5] Jockers, M.L., Mimno, D.: Significant themes in 19th-century literature. Poetics 41(6), 750{769 (2013) [6] Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse processes 25(2-3), 259{284 (1998) [7] Munnelly, G., O'Connor, A., Edmond, J., Lawless, S.: Finding meaning in the chaos (2015) [8] Real, R., Vargas, J.M.: The probabilistic basis of jaccard's index of similarity. Systematic biology 45(3), 380{385 (1996) [9] Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. Journal of the American Statistical Association 101(476), 1566{1581 (2006) Poster [abstract]
ARKWORK: Archaeological practices and knowledge in the digital environment 1Uppsala University,; 2University of Helsinki; 3University of Toronto; 4Vilnius University; 5University of Glasgow; 6University of Venice; 7Umeå University; 8University of Copenhagen; 9Independent researcher Archaeology and material cultural heritage have often enjoyed a particular status as a form of heritage that has captured the public imagination. As researchers from many backgrounds have discussed, it has become the locus for the expression and negotiation of European, local, regional, national and intra-national cultural identities, for public policy regarding the preservation and management of cultural resources, and for societal value in the context of education, tourism, leisure and well-being. The material presence of objects and structures in European cities and landscapes, the range of archaeological collections in museums around the world, the monumentality of the major archaeological sites, and the popular and non-professional interest in the material past are only a few of the reasons why archaeology has become a linchpin in the discussions on how emerging digital technologies and digitization can be leveraged for societal benefit. However, at the time when nations and the European community are making considerable investments in creating technologies, infrastructures and standards for digitization, preservation and dissemination of archaeological knowledge, critical understanding of the means and practices of knowledge production in and about archaeology from complementary disciplinary perspectives and across European countries remains fragmentary, and in urgent need of concertation. In contrast to the rapid development of digital infrastructures and tools for archaeological work, relatively little is known about how digital information, tools and infrastructures are used by archaeologists and other users and producers of archaeological information such as archaeological and museum volunteers, avocational hobbyists, and others. Digital technologies (infrastructures, methods and resources) are reconfiguring aspects of archaeology across and beyond the lifecycle (i.e., also "in the wild"), from archaeological data capture in fieldwork to scholarly publication and community access/entanglement.Both archaeologists and researchers in other fields, from disciplines such as museum studies, ethnology, anthropology, information studies and science and technology studies have conducted research on the topic but so far, their efforts have tended to be somewhat fragmented and anecdotal. This is surprising, as the need of better understanding of archaeological practices and knowledge work has been identified for many years as a major impediment to realizing the potential of infrastructural and tools-related developments in archaeology. The shifts in archaeological practice, and in how digital technology is used for archaeological purposes, calls for a radically transdisciplinary (if not interdisciplinary) approach that brings together perspectives from reflexive, theoretically and methodologically-aware archaeology, information research, and sociological, anthropological and organizational studies of practice. This poster presents the COST Action “Archaeological practices and knowledge work in the digital environment” (http://www.cost.eu/COST_Actions/ca/CA15201 - ARKWORK), an EU-funded network which brings together researchers, practitioners, and research projects studying archaeological practices, knowledge production and use, social impact and industrial potential of archaeological knowledge to present and highlight the on-going work on the topic around Europe. ARKWORK (https://www.arkwork.eu/) consists of four Working Groups (WGs), with a common objective to discuss and practice the possibilities for applying the understanding of archaeological knowledge production to tackle on-going societal challenges and the development of appropriate management/leadership structures for archaeological heritage. The individual WGs have the following specific but complementary themes and objectives: WG1 - Archaeological fieldwork Objectives: To bring together and develop the international transdisciplinary state-of-the-art of the current multidisciplinary research on archaeological fieldwork. How archaeologists are conducting fieldwork and documenting their work and findings in different countries and contexts and how this knowledge can be used to make contributions to developing fieldwork practices and the use and usability of archaeological documentation by the different stakeholder groups in the society. WG2 - Knowledge production and archaeological collections Objectives: To integrate and push forward the current state-of-the-art in understanding and facilitating the use and curation of (museum) collections and repositories of archaeological data for knowledge production in the society. WG3 - Archaeological knowledge production and global communities Objectives: To bring together and develop the current state-of-the-art on the global communities (including indigenous communities, amateurs, neo-paganism movement, geographical and ideological identity networks and etc.) as producers and users in archaeological knowledge production e.g. in terms of highlighting community needs, approaches to communication of archaeological heritage, crowdsourcing and volunteer participation. WG4 - Archaeological scholarship Objectives: To integrate and push forward the current state-of-the-art in study of archaeological scholarship including academic, professional and citizen science based scientific and scholarly work. In our poster we outline each of the working groups and provide a clear overview of the purposes and aspirations of the COST Action Network ARKWORK Poster [publication ready]
Research and development efforts on the digitized historical newspaper and journal collection of The National Library of Finland University of Helsinki, Finland, The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12 million pages mainly in Finnish and Swedish. Out of these about 5.1 million pages are freely available on the web site digi.kansalliskirjasto.fi (Digi). The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1920. The last ten years, 1911–1920, were opened in February 2017. The digitized collection of NLF is part of globally expanding network of library produced historical data that offers researchers and lay persons insight into past. In 2012 it was estimated that there were about 129 million pages and 24 000 titles of digitized newspapers in Europe [1]. A very conservative estimation about worldwide number of titles is 45 000 [2]. The current number of available data is probably already much bigger, as the national libraries have been working steadily with digitization both in Europe, Northern America and rest of the world. This paper presents work that has been carried out in the NLF related to the historical newspaper and journal collection. We offer an overall account of research and development related to the data. Poster [abstract]
Medieval Publishing from c. 1000 to 1500 Helsinki University Medieval Publishing from c. 1000 to 1500 (MedPub) is a five-year project funded by the European Research Council, based at Helsinki University, and running from 2017 2022. The project seeks to define the medieval act of publishing, focusing on Latin authors active during the period from c. 1000 to 1500. A part of the project is to establish a database of networks of publishing. The proposed paper will discuss the main aspects of the projected database and the process of data-gathering. MedPub’s research hypothesis is that publication strategies were not a constant but were liable to change, and that different social, literary, institutional, and technical milieux fostered different approaches to publishing. As we have already proved this proposition, the project is now advancing toward the next step, the ultimate aim of which is to complement the perception of societal and cultural changes that took place during the period from c. 1000 and 1500. For the purposes of that undertaking, we define ‘publishing’ as a social act, involving at least two parties, an author and an audience, not necessarily always brought together. The former prepares a literary work and then makes it available to the latter. Medieval publishing was probably more often a more complex process. It could engage more parties than the two, such as commentators, dedicatees, and commissioners. The social status of these networks ranged from mediocre to grand. They could consist of otherwise unknown monks; or they could include popes and emperors. We propose that the composition of such literary networks was broadly reactive to large-scale societal and cultural changes. If so, networks of publishing can serve as a vantage point for the observation of continuity and change in medieval societies. We shall collect and analyse an abundance of data of publishing networks in order to trace how their composition in various contexts may reflect the wider world. It is that last-mentioned aspect that is the subject of this proposal. It is a central fact for this undertaking that medieval works very often include information on dedication, commission, and commendation; and that, more often than not, this evidence is uncomplicated to collect because the statements in question tend to be short and uniform and they normally appear in the prefaces and dedicatory letters with which medieval authors often opened their works. What is more, such accounts manifestly indicate a bond between two or more parties. By virtue of these features, the evidence in question can be collected in the quantities needed for large-scale statistical analysis and processed electronically. The function and form of medieval references to dedication and commission, furthermore, remained largely a constant. Eleventh-century dedications resemble those from, say, the fourteenth century. By virtue of such uniformity the data of dedications and commissions may well constitute a unique pool of evidence of social interaction in the Middle Ages. For the data of dedications and commissions can be employed as statistical evidence in various regional, chronological, social, and institutional contexts, something that is very rare in medieval studies. The proposed paper will introduce the categories of information the database is to embrace and put forward for discussion the modus operandi of how the data of dedications and commissions will be harvested. Poster [abstract]
Making a bibliography using metadata National Library of Norway, Norway, In this presentation we will discuss how one might create a bibliography using metadata taken from libraries in conjunction with other sources. As metadata, like topic keywords and Dewey decimal classification, is digitally available our focus is on metadata, although we also look at book contents where it is possible. Poster [abstract]
Network Analysis, Network Modeling, and Historical Big Data: The New Networks of Japanese Americans in World War II University of Helsinki Network analysis has become a promising methodology for studying a wide variety of systems, including historical populations. It brings new dimensions into the study of questions that social scientists and historians might traditionally ask, and allows for new questions that were previously impractical or impossible to answer using traditional methods. The increasing availability of digitized archival material and big data, however, are making it more appealing. When coupled with custom algorithms and interactive visualization tools, network analysis can produce remarkable new insights. In my ongoing doctoral research, I am employing network analysis and modeling to study the Japanese American incarceration in World War II (internment). Incarceration and the government-led dispersal of Japanese Americans disrupted the lives of some 110,000 people, including over 70,000 US citizens of Japanese ancestry, for the duration of the war and beyond. Many lost their former homes and enterprises and had to start their lives over after the war. Incarceration also had a very concrete impact on the communities: about 50% of those interned did not return to their old homes. This paper explores the changes that took place in the Japanese American community of Heart Mountain Relocation Center in Wyoming. I will especially investigate the political networks and power relations of the incarceration community. My aim is twofold: on the one hand, to discuss the changes in networks caused by incarceration and dispersal, and on the other, to address some opportunities and challenges presented by the method for the study of history. Poster [abstract]
SuALT: Collaborative Research Infrastructure for Archaeological Finds and Public Engagement through Linked Open Data 1University of Helsinki,Department of Philosophy, History, Culture and Art Studies; 2Aalto University, Semantic Computing Research Group (SeCo); 3University of Helsinki, HELDIG – Helsinki Centre for Digital Humanities; 4National Board of Antiquities, Library, Archives and Archaeological Collections The Finnish Archaeological Finds Recording Linked Database (Suomen arkeologisten löytöjen linkitetty tietokanta – SuALT) is a concept for a digital web service catering for discoveries of archaeological material made by the public; especially, but not exclusively, metal detectorists. SuALT, a consortium project funded by the Academy of Finland and commenced in September 2017, has key outputs at every stage of its development. Ultimately it provides a sustainable output in the form of Linked Data, continuing to facilitate new public engagements with cultural heritage, and research opportunities, long after the project has ended. While prohibited in some countries, metal detecting is legal in Finland, provided certain rules are followed, such as prompt reporting of finds to the appropriate authorities and avoidance of legally-protected sites. Despite misgivings by some about the value of researching metal-detected finds, others have demonstrated the potential of researching such finds, for example uncovering previously unknown artefact typologies. Engaging non-professionals with cultural heritage also contributes to the democratization of archaeology, and empowers citizens. In Finland metal detecting has grown rapidly in recent years. In 2011 the Archaeological Collections registered 31 single or assemblages of stray finds. In 2014, over 2700 objects were registered, in 2015, near 3000. In 2016 over 2500 finds were registered. When the finds are reported correctly, their research value is significant. The Finnish Antiquities Act §16 obligates the finder of an object for which the owner is not known, and which can be expected to be at least 100 years old, to submit or report the object and associated information to the National Board of Antiquities (Museovirasto – NBA); the agency responsible for cultural heritage management in Finland. There is also a risk, as finders get older and even pass away, that their discoveries and collections will remain unrecorded and that all associated information is lost permanently. In the current state of the art, while archaeologists increasingly use finds information and other data, utilization is still limited. Data can be hard to find, and available open data remains fragmented. SuALT will speed up the process of recording finds data. Because much of this data will be from outside of formal archaeological excavations, it may shed light on sites and features not usually picked up through ‘traditional’ fieldwork approaches, such as previously unknown conflict sites. The interdisciplinary approach and inclusion of user research promotes collaboration among the infrastructure’s producers, processors and consumers. By linking in with European projects, SuALT enables not only national and regional studies, but also contributes to international and transnational studies. This is significant for studies of different archaeological periods, for which the material culture usually transcends contemporary national boundaries. Ethical aspects are challenged due to the debates around engagement with metal detectorists and other artefact hunters by cultural heritage professionals and researchers, and we address head-on the wider questions around data sharing and knowledge ownership, and of working with human subjects. This includes the issues, as identified by colleagues working similar projects elsewhere, around the concerns of metal detectorists and other finders about sharing findspot information. Finally, the usability of datasets has to be addressed, considering for example controlled vocabulary to ease object type categorization, interoperability with other datasets, and the mechanics of verification and publication processes. The project is unique in responding to the archaeological conditions in Finland, and in providing solutions to its users’ needs within the context of Finnish society and cultural heritage legislation. While it focuses primarily on the metal detecting community, its results and the software tools developed are applicable more generally to other fields of citizen science in cultural heritage, and even beyond. For example, in many areas of collecting (e.g. coins, stamps, guns, or art), much cultural heritage knowledge as well as collections are accumulated and maintained by skilful amateurs and private collectors. Fostering collaboration, and integrating and linking these resources with those in national memory organizations would be beneficial to all parties involved, and points to future applications of the model developed by SuALT. Furthermore, there is scope to integrate SuALT into wider digital humanities networks such as DARIAH (http://www.dariah.eu). Framing SuALT’s development as a consortium enables us to ask important questions even at development stages, with the benefit of expertise from diverse disciplines and research environments. The benefits of SuALT, aside from the huge potential for regional, national, and transnational research projects and international collaboration, are that it offers long term savings on costs, shares expertise and provides greater sustainability than already possible. We will explore the feasibility of publishing the finds data through international aggregation portals, such as Europeana (http://www.europeana.eu) for cultural heritage content, as well as working closely with colleagues in countries that already have established national finds databases. The technical implementation also respects the enterprise architecture of Finnish public government. Existing Open Source solutions are further developed and integrated, for example the GIS platform Oskari.org (http://oskari.org) for geodata developed by the National Land Survey with the Linked Data based Finnish Ontology Service of Historical Places and Maps (http://hipla.fi). SuALT’s data is also disseminated through Finna (http://www.finna.fi), a leading service for searching cultural information in Finland. SuALT consists of three subprojects: subproject I “User Needs and Public Cultural Heritage Interactions” hosted by University of Helsinki; subproject II “National Linked Open Data Service of Archaeological Finds in Finland” hosted by Aalto University, and subproject III “Ensuring Sustainability of SuALT” hosted by the NBA. The primary aim of SuALT is to produce an open Linked Data service which is used by data producers (namely the metal detectorists and other finders of archaeological material), by data researchers (such as archaeologists, museum curators and the wider public), and by cultural heritage managers (NBA). More specifically, the aims are: a. To discover and analyse the needs of potential users of the resource, and to factor these findings into its development; b. To develop metadata models and related ontologies for the data that take into account the specific needs of this particular infrastructure, informed by existing models; c. To develop the Linked Data model in a way that makes it semantically interoperable with existing cultural heritage databases within Finland; d. To develop the Linked Data model in a way that makes it semantically interoperable with comparable ‘finds databases’ elsewhere in Europe, and e. To test the data resulting from SuALT through exploratory research of the datasets for archaeological research purposes for cultural heritage and collection management work. The project corresponds closely with the strategic plans of the NBA and responds to the growth of metal detecting in Finland. Internationally, it corresponds with the development of comparable schemes in other European countries and regions, such as Flanders (MetaaldEtectie en Archeologie – MEDEA initiated in 2014), and Denmark and the Netherlands (Digitale Metaldetektorfund or DIgital MEtal detector finds – DIME, and Portable Antiquities in the Netherlands – PAN, both initiated in 2016). It takes inspiration from the Portable Antiquities Scheme (PAS) Finds Database (https://finds.org.uk/database) in England and Wales. These all aspire to an ultimate goal of a pan-European research infrastructure, and will work together to seek a larger international collaborative research grant in the future. A contribution of our work in relation to the other European projects is to employ the Linked Data paradigm, which facilitates better interoperability with related datasets, additional data enrichment based on well-defined semantics and reasoning, and therefore better means for analysing and using the finds data in research and applications. The expected scientific impacts are that the process of developing SuALT, including critically analysing comparable resources, user group research, and creating innovative solutions, will in themselves produce a rich body of interdisciplinary academic output. This will be disseminated in peer reviewed journals and at selected conferences across several disciplinary boundaries including Computer Science, Archaeology, and Cultural Heritage Studies. It also links in, at a crucial moment in the development of digital heritage management, with parallel resources elsewhere in Europe. This means that not only can a coordinated and international approach be taken in development, but that it is extremely timely, taking advantage of the opportunity to benefit from the experiences and perspectives of colleagues pursuing similar resources. SuALT ensures that Finnish cultural heritage management is at the forefront of digital heritage. The project also carries out a small-scale ‘test’ project using the database as it forms, and in this way contributes to the field of artefact studies. The contribution to future knowledge sits at a number of levels. There are technical challenges to create the linked database in a way that complements and is interoperable with existing national and international infrastructures. Solving these challenges generates contributions to understanding digital data management and service. The process of consulting users represents an important case study in formative evaluation of particular interest groups with regard to digital heritage and citizen science, as well as shedding further light on different perceptions and uses of cultural heritage. SuALT relates to the emerging trend of publishing open science data, facilitating the analysis and reuse of the data, exemplified by e.g. DataONE (http://www.dataone.org) and Open Science Data Cloud (http://www.opensciencedatacloud.org). We hypothesise that SuALT will result in a sustainable digital data resource that responds to the different user needs, and which provides high quality archaeological research which draws on data from Finland. SuALT also enables integration with comparative data from abroad. Outputs throughout the development process represent important contributions to research into digital heritage applications and semantic computing, going the needs of the scientific community. The selected Linked Data methodology is suitable for archaeology and cultural heritage management due to the need to combine and connect heterogeneous data collections in the field (e.g. museum collections, finds databases abroad) and other datasets, such as vocabularies of places, persons, and time periods, benefiting cultural heritage professionals. Publishing the finds database as open data using standardised metadata formats facilitates the data’s re-use, fostering new research by the scientific community but also the development of novel applications for professionals and citizens. Taking a strategic approach to the challenge of creating this resource, and treating it as a research project, rather than developing an ad hoc resource, ensures that the project’s legacy is a significant and long-term contribution to digital curation of public-generated archaeological data. As its key societal impact, SuALT provides a vital interface for non-professionals to contribute to and benefit from Finland’s archaeological record, and to integrate this with comparable datasets from abroad. The project enhances cooperation between non-professionals and cultural heritage managers. Careful user research ensures that SuALT offers means of engagement and access to data and other information that is usable and meaningful to a wide range of users, from metal detectorists and amateur historians, through to professional curators, cultural heritage managers, and academic researchers, domestically and abroad. SuALT’s results are not limited to metal detection but have a wider impact: the same key challenges of engaging amateur collectors to collaborate with memory organization experts in citizen science are encountered in virtually all fields of collecting and maintaining tangible and intangible cultural heritage. The process of developing SuALT provides an unprecedented opportunity to research the use of digital platforms to engage the public with archaeological heritage in Finland. Inspired by successful initiatives such as PAS and MEDEA, the potential for individuals to self-record their finds also echoes the emerging use of crowdsourcing for public archaeology initiatives. Thus, SuALT offers a significant opportunity to contribute to further understanding digital cultural heritage and its uses, including its role within society. It is likely that the coordination of SuALT with digital finds recording initiatives in other countries will lead to a transnational platform for finds recording, giving Finland an opportunity to be at the forefront of digital heritage-based citizen science research and development. Poster [abstract]
Identifying poetry based on library catalogue metadata University of Helsinki, Changes in printing reflect historical turning points: what has been printed, when, where and by whom are all derivatives of contemporary events and situations. Excessive need for war propaganda brings out more pamphlets from the printing presses, the university towns produce dissertations, which scientific development can be deduced from and strict oppression and censorship might allow only religious publications by government-approved publishers. The history of printing has been extensively studied and numerous monographs exist. However, most of the research has been qualitative studies based on close reading requiring a profound knowledge of the subject matter, yet still being unable to verify the extent of the new innovations. For example, close readings of library catalogues does not reveal, at least easily, the timeline of Luther’s publications, or what portion of books actually were octavo-sized and when the increase in this format occurred. One of the sources for these kinds of studies are national library metadata catalogs which contain information about physical book size, page counts, publishers, publication places and so forth. These catalogs have been researched in ways making use of quantitative analysis. The advantage of national library catalogs is that they often are more or less complete, having records of practically everything published in a certain country or linguistic area in a certain time period. The computational approach to them has enabled researchers to connect historical turning points to the effect on printing, and the impact of a new concept has been measured against the amount of re-publications, or the spread, of a book introducing a new idea. What is more, linking library metadata to the full text of the books has made it possible to analyze the change in the usage of words in massive corpora, while still limiting analysis to relevant books. In all these cases, computational methods work better the more complete the corpus is. However, library catalogues often lack annotations for one reason or another: annotating resources might have been cut at a certain point in time, or the annotation rules may have varied between different libraries in cases where catalogues have been amalgamated, or the rules could have just changed. One area that is particularly important for subcorpora research is genre. The genre field, when annotated for each of the metadata records, could be used to restrict the corpus to contain every one of the books that are needed and nothing more. From this subset there is a possibility of drawing timelines or graphs based on bibliographic metadata, or in the case of full texts existing, the language or contents of a complete corpus could be analysed. Despite the significance of the genre information, that particular annotation bit is often lacking. In English Short Title Catalogue (ESTC) the genre information exists for approximately one fourth of the records. This should be enough for teaching a model for machine learning and trying to deduce the genre information, rather than relying solely on the annotations of librarians. The metadata field containing genre information in ESTC can contain more than one value. In most cases this means having a category and its subcategories as different values, but not always. Because of the complex definition of genre in ESTC this paper focuses on one genre only: poetry. Besides being a relatively common genre, poetry is also of interest to literary researchers. Having a nearly complete subset of English poetry would allow for large-scale quantitative poetry analysis. The downside to library metadata catalogues is, that they contain merely the metadata, not the complete unabridged texts, which would be beneficial for machine learning modeling. I tackled this shortcoming by creating several models each packed with similar features within that set. The main ingredient for these feature sets was a concatenation of the main title and the subtitle from the library metadata. From these concatenations I created one feature set contained easily calculable features known from the earliest stylometric research, such as word counts and sentence lengths. Another set I collected with bag-of-words method taking the frequencies of the most common words from a subset of poetry book titles. I also built one set for part-of-speech (POS) tags and another one for POS trigrams. Some feature sets were extracted from the other metadata fields. Physical book size, page count, topic and the same author having published a poetry book proved worthy in the classification. From these feature sets I handpicked the best performing features into one superset. The resulting model performed really good: despite the compactness of the metadata, the poetry books could be tracked with a precision over 90% and a recall over 86%. I then made another run with the superset to seek the poetry books, which did not have genre field annotated in the catalogue. Combining the results from the run with close reading revealed over 14,000 unannotated poetry books. I sampled one hundred of both poetry and non-poetry books to manually estimate the correctness of the predictions and found out an annotation bias in the catalogue. The bias seems to come from the fact, that the genre information has been annotated more frequently for broadside poetry books, than for the other broadsides. Excluding broadsides from my samples I got a recall value 94% and precision 98%. My research strongly suggest, that semi-supervised learning can be applied with library catalogues to fill in missing annotations, but this requires close attention to avoid possible pitfalls. Poster [publication ready]
Open Digital Humanities: International Relations in PARTHENOS University of Copenhagen, CLARIN ERIC One of the strong instruments for the promotion of Open Science in Digital Humanities is research infrastructures. PARTHENOS is a European research infrastructure project, basically built upon collaboration between two large the research infrastructures in the humanities CLARIN and DARIAH, plus a number of other initiatives. PARTHENOS aims at strengthening the cohesion of research in the broad sector of Linguistic Studies, Humanities, Cultural Heritage, History, Archaeology and related fields. This is the context in which we should see the efforts related to international liaisons. This effort takes its point of departure in the existing international relations, so the first action was to collect information and to analyse it along different dimensions. Secondly, we want to analyse the purpose and aims of international collaboration. There are many ideas about how the international network may be strengthened and exploited, so that higher quality is obtained, and more data, tools and services are shared. The main task of the next year will be to first agree on a strategy and then implement it in collaboration with the rest of the project. By doing so, the PARTHENOS partners will be contributing even more to the European Open Science Policies. Poster [abstract]
The New Face of Ethnography: Utilizing Cyberspace as an Alternative Study Site University of California, Merced, American adoption has a familiar mission to find families for children but becomes strange when turned on its head and exposed as an institution that instead finds children for families who are willing to pay any price for a child. Its evolution, from orphan trains to open adoptions, has answered questions about biological associations but has conflated the interconnection of identity with conflicting narratives of community, kinship and self. How do the experiences of the adoption constellation reconceptualize the national image of adoption as a win-win solution to a social problem? My research explores the language utilized in multiple adoption narratives to determine individual and universal feelings that adoptees, birth parents, and adoptive parents experience regarding the transfer of children in the United States and the long term emotional outcomes for these groups. My unique approach to ethnographic research includes a hybrid digital and humanistic approach using online and offline interactions to gather data. As is the case with all methodology, online ethnography presents both benefits and problems. On the plus side, online communities break down the walls of networks, creating digitally mediated social spaces. The Internet provides a platform for social interactions where real and virtual worlds shift and conflate. Social interactions in cybernetic environments present another option for social researchers and offer significant advantages for data collection, collaboration, and maintenance of research relationships. For some research subjects, such as members of the adoption constellation, locating target groups presents challenges for domestic adoption researchers. Online groups such as Facebook pages dedicated to specific members of the adoption triad offer a resolution to this challenge, acting as self-sorted focus groups with participants eager to provide their narratives and experiences. Ethnography involves understanding how people experience their lives through observation and non-directed interaction, with a goal of observing participants’ behavior and reactions on their own terms; this can be achieved through the presumed anonymity of online interaction. Electronic ethnography provides valuable insights and data; however, on the negative side, the danger of groupthink in Facebook communities can both attract and generate homogeneous experiences regarding adoption issues. I argue that the benefit of online ethnography outweighs the problems and can provide important, previously unexpressed views to better analyze topics such as the adoption experience. Social interactions in cybernetic environments offer significant advantages for data collection, collaboration, and maintenance of research relationships as it remains a fluid yet stable alternate social space. Late-Breaking Work
Elias Lönnrot Letters Online Finnish Literature Society The correspondence of Elias Lönnrot (1802–1884, doctor, philologist and creator of the national epic Kalevala) comprises of 2 500 letters or drafts written by Lönnrot and 3 500 letters received. Elias Lönnrot Letters Online (http://lonnrot.finlit.fi/omeka/), first published in April 2017, is the conlusion of several decades of research, of transcribing and digitizing letters and of writing commentaries. The online edition is designed not only for those interested in the life and work of Lönnrot himself, but more generally to scholars and general public interested in the work and mentality of the Finnish 19th century nationalistic academic community , their language practices both in Swedish and in Finnish, and in the study of epistolary culture. The rich, versatile correspondence offers source material for research in biography, folklores studies and literary studies; for general history as well as medical history and the history of ideas; for the study of ego documents and networks; and for corpus linguistics and history of language. As of January 2018, the edition contains about 2000 letters and drafts of letters sent by Elias Lönnrot (1802-1884, doctor, philologist and creator of the national epic Kalevala). These are mostly private letters. The official letters, such as the medical reports submitted by Lönnrot in his office as a physician, will be added during 2018. The final stage will involve finding a suitable way of publishing for the approximately 3500 letters that Lönnrot received. The edition is built on the open-source publishing platform Omeka. Each letter and draft of letter is published as facsimile images and an XML/TEI5 file, which contains metadata and transcription. The letters are organised into collections according to recipient, with the exception of for example Lönnrot's family letters, which are published in a single collection. An open text search covers the metadata and transcriptions. This is a faceted search powered by Apache's Solr which allows limiting the initial search by collection, date, language, type of document and writing location. In addition, Omeka's own search can be used to find letters based on a handful of metadata fields. The solutions adopted for the Lönnrot edition differ in some respects from the established practices of digital publishing of manuscripts in the humanities. In particular, the TEI encoding of the transcriptions is lighter than in many other scholarly editions. Lönnrot's own markings – underlinings, additions, deletions – and unclear and indecipherable sections in the texts are encoded, but place and personal names are not. This is partially due to the extensive amount of work such detailed encoding would require, partially because the open text search provides quick and easy access to the same information. The guiding principle of Elias Lönnrot Letters is openness of data. All the data contained in the edition is made openly available. Firstly, the XML/TEI5 files are available for download, and researchers and other users are free to modify them for their own purposes. The users can download the XML/TEI5 files of all the letters, or of a smaller section such as an individual collection. The feature is also integrated in the open text search, and can be used both for all the results produced by a search and a smaller section of the results limited by one or more facets. Thus, an individual researcher can download the XML files of the letters and study them for example with the linguistic tools provided by the Language Bank of Finland. Similarly, the raw data is available for processing and modifying by those researchers who use and develop digital humanities tools and methods to solve research questions. Secondly, the letter transcriptions are made available for download as plain text. Data in this format is needed for qualitative analysis tools like Atlas. In addition, researchers in humanities do not all need XML files but will benefit from the ability to store relevant data in an easily readable format. Thirdly, users of the edition can export the statistical data contained in the facet listing of each search result for processing and visualization with tools like Excel. Statistical data like this is significant in handling large masses of data, as it can reveal aspects that would remain hidden when examining individual documents. For example, it may be relevant to a researcher in what era and with whom Lönnrot primarily discussed a given theme. The statistical data of the facet search readily reveals such information, while compiling such statistics by manually going through thousands of letters would be an impossibly long process. The easy availability of data in Elias Lönnrot Letters Online will hopefully foster collaboration and enrich research in general. The SKS is already collaborating with Finn-Clarin and the Language Bank, which have received the XML/TEI5 files. As Lönnrot's letters form an exceptionally large collection of manuscripts written by one hand, a section of the letters together with their transcriptions was given to the international READ project, which is working to develop machine recognition of old handwritten texts. A third collaborating partner is the project "STRATAS – Intefacing structured and unstructured data in sociolinguistic research on language change". Late-Breaking Work
KuKa Digi -project University of Helsinki This poster presents a sample of the Cultural Studies BA program’s Digital Leap project called KuKa Digi. The Digital Leap is a university wide project that aims to support digitalization in both learning and teaching in the new degree programs at the University of Helsinki. For more information on the University of Helsinki’s Digital Leap program, please refer to: http://blogs.helsinki.fi/digiloikka/ . The new Bachelor’s Program in Cultural Studies, was among the projects selected for the 2018-2019 round of the Digital Leap. The primary goal of the KuKa Digi project is to produce meaningful digital material for both teaching and learning purposes. The KuKa Digi project aims to develop the program’s courses, learning environments and materials into a more digital direction. Another goal of the project is to produce an introductory MOOC –course on Cultural Studies for university students, as well as students studying for their A-levels, who may be planning to apply for the Cultural Studies BA program. Finally, we will write a research article to assess the use of digital environments in teaching and learning processes within Cultural Studies BA program. Kuka Digi –project encourages students and teachers to co-operatively plan digital learning environments that are also useful in building up students’ academic portfolio and enhance their working life skills. The core idea of the project is to create a digital platform or database for teachers, researchers and students in the field of Cultural Studies. Academic networking sites do exist, however they are not without issues. Many of them are either not accessible, or very useful for students, who have not developed their academic careers very far yet. In addition to this, some of these sites are only partially free of charge. The digital platform will act as a place where students, teachers and researchers alike can have the opportunity to network, advertise their expertise and specialization as well as, come into contact with the media, cultural agencies, companies and much more. The general vision for this platform is that it will be user friendly, flexible as well as, act as an “academic Linked In”. The database will be available in Finnish, Swedish and English. The database will include the current students, teachers and experts, who are associated with the program. Furthermore, the platform will include a feature called the digital portfolio. This will be especially useful for our students, as it is intended to be a digital tool with which they can develop their own expertise within the field of Cultural Studies. Finally, the portfolio will act as a digital business card for the students. The Project poster presented at the conference illustrates the ideas and concepts for the platform in more detail. For more information on the project and its other goals, please refer to the project blog at: http://blogs.helsinki.fi/kuka-digi/ Late-Breaking Work
Topic modelling and qualitative textual analysis University of Helsinki, The pursuit of big data is transforming qualitative textual analysis—a laborious activity that has conventionally been executed manually by researchers. Access to data of unprecedented scale and scope has created a need to both analyse large data sets efficiently and react to their emergence in a near-real-time manner (Mills, 2017). As a result, research practices are also changing. A growing number of scholars have experimented with using machine learning as the main or complementary method for text analysis. Even if the most audacious assumptions ‘on the superior forms of intelligence and erudition’ of big data analysis are today critically challenged by qualitative and mixed-method researchers (Mills, 2017: 2), it is imperative for scholars using qualitative methods to consider the role of computational techniques in their research (Janasik, Honkela and Bruun, 2009). Social scientists are especially intrigued by the potential of topic modelling (TM), a machine learning method for big data analysis (Blei, 2012), as a tool for analysis of textual data. This research contributes to a critical discussion in social science methodologies: how topic modeling can concretely be incorporated into existing processes of qualitative textual analysis and interpretation. Some recent studies paid attention to the methodological dimensions of TM vis-à-vis textual analysis. However, these developments remain sporadic, exemplifying a need for a systematic account of the conditions under which TM can be useful for social scientists engaged in textual analysis. This paper builds upon the existing discussions, and takes a step further by comparing the assumptions, analytical procedures and conventional usage of qualitative textual analysis methods and TM. Our findings show that for content and classification methods, embedding TM into research design can partially and, arguably, in some cases fully automate the analysis. Discourse and representation methods can be augmented with TM in sequential mixed-method research design. Summing up, we see avenues for TM both in embedded and sequential mixed-method research design. This is in line with previous work on mixed-method research that has challenged the traditional assumption of there being a clear division between qualitative and quantitative methods. Scholarly capacity to craft a robust research design depends on researchers’ familiarity with specific techniques, their epistemological assumptions, and good knowledge of the phenomena that are being investigated to facilitate the substantial interpretation of the results. We expect this research to help identify and address the critical points, thereby assisting researchers in the development of novel mixed-method designs that unlock the potential of TM in qualitative textual analysis without compromising methodological robustness. Blei, D. M. (2012) ‘Probabilistic topic models’, Communications of the ACM, 55(4), p. 77. Janasik, N., Honkela, T. and Bruun, H. (2009) ‘Text Mining in Qualitative Research’, Organizational Research Methods, 12(3), pp. 436–460. Mills, K. A. (2017) ‘What are the threats and potentials of big data for qualitative research?’, Qualitative Research, p. 146879411774346. Late-Breaking Work
Local Letters to Newspapers - Digital History Project University of Tampere, The Centre of Excellence in the History of Experiences (HEX) The Local Letters to Newspapers is a digital history project of the Academy of Finland Centre of Excellence in the History of Experiences HEX (2018–2025), hosted by University of Tampere. The objective is to make a new kind of digital research material available from the 19th and the early 20th century Finnish society. The aim is to introduce a database of the readers' letters submitted to the Finnish press that could be studied both qualitatively and quantitatively. The database will allow analyzing the 19th and 20th century global reality through a case study of the Finnish society. It will enable a wide range of research topics and open a path to various research approaches, especially the study of human experiences. Late-Breaking Work
Lessons Learned from Historical Pandemics. Using crowdsourcing 2.0 and Citizen Science to map the Spanish Flus spatial and social network. Aarhus City Archives By Søren K. Poder MA. In history & Astrid Lykke Birkving, MA in intellectual History Aarhus City Archvies | Redia a/s In 1918 the World was struck by the most devastating disease in recorded history - today known as the Spanish Flu. In less than one year nearly two third of world’s population came down with influenza. Of which between forty and one hundred million people died. The Spanish Flu in 1918 did not originated in Spain, but most likely on the North American east coast in February 1918. By the middle of Marts, the influenza had spread to most of the overcrowded American army camps from where it soon was carried to the trenches in France and the rest of the World. This part of the story is well known. In contrast the diffusion of the 1918-pandemic, and the seasonal epidemics for that matter, on the regional and local level is still largely obscure. For instance, an explanation on why epidemics evidently tends to follow significantly different paths in different urban areas that otherwise seems to share a common social, commercial and cultural profile, tend to be more theoretical then based on evidence. For one sole reason – the lack of adequate data. As part of the incessantly scientific interest in historical epidemics, the purpose of this research project is to identify the social, economic and cultural preconditions that most likely determines a given type of locality’s ability to spread or halter an epidemic’s hieratical diffusion. Crowdsourcing 2.0 To meet ends data large amounts of data from a variety of different historical sources as to be collected and linked together. To do this we use traditional crowdsourcing techniques, where volunteers participates in transcribing different historical documents. Death certificates, census, patient charts etc. But just as important does the collected transcription form the base for a text recognition ML module that in time will be able recognize specific entities in a document – persons, placers, diagnoses dates ect. Late-Breaking Work
Analysing Swedish Parliamentary Voting Data University of Gothenburg, We used publicly available data from voting sessions in the Swedish Parliament to represent each member of parliament (MP) as a vector in a space defined by their voting record between the years 2014 and 2017. We then applied matrix factorization techniques that enabled us to find insightful projections of this data. Namely, it allowed the assessment of the level of clustering of MPs according to their party line while at the same time identifying MPs whose voting record is closer to other parties'. It also provided a data-driven multi-dimensional political compass that allows to ascertain similitudes and differences between MPs and political parties. Currently, the axes of the compass are unlabeled and therefore they lack a clear interpretation, but we plan to apply language technology on the parliamentary discussions associated to the voting sessions on order to identify the topics associated to these axis. Late-Breaking Work
Automated Cognate Discovery in the Context of Low-Resource Sami Languages University of Helsinki 1 Introduction The goal of our project is to automatically find candidates for etymologically related words, known as cognates, for different Sami languages. At first, we will focus on North Sami, South Sami and Skolt Sami nouns by comparing their inflectional forms with each other. The reason why we look at the inflections is that, in Uralic languages, it is common that there are changes in the word stem when the word is inflected in different cases. When finding cognates, the non-nominative stems might reveal more about a cognate relationship in some cases. For example, the South Sami word for arm, g ̈ıete, is closer to the partitive of the Finnish word k ̈att ̈a than to the nominative form k ̈asi of the same word. The fact that a great deal of previous work already exists related to etymolo- gies of words in different Sami languages [2, 4, 8] provides us with an interesting test bed for developing our automatic methods. The results can easily be vali- dated against databases such as A ́lgu [1] which incorporates results of different studies in Sami etymology in a machine-readable database. With the help of a gold corpus, such as A ́lgu, we can perfect our method to function well in the case of the three aforementioned Sami languages. Later, we can expand the set of languages used to other Uralic languages such as Erzya and Moksha. This is achievable as we are basing our method on the data and tools developed in the Giellatekno infrastructure [11] for Uralic languages. Giellatekno has a harmonized set of tools and dictionaries for around 20 different Uralic languages allowing us to bootstrap more languages into our method. 2 Related Work In historical linguistics, cognate sets have been traditionally identified using the comparative method, the manual identification of systematic sound corre- spondences across words in pairs of languages. Along with the rapid increase in digitally available language data, computational approaches to automate this process have become increasingly attractive. Computationally, automatic cognate identification can be considered a prob- lem of clustering similar strings together, according to pairwise similarity scores given by some distance metric. Another approach to the problem is pairwise classification of word pairs as cognates or non-cognates. Examples of common distance metrics for string comparison include edit distance, longest common subsequence, and Dice coefficient. The string edit distance is often used as a baseline for word comparison, measuring word similarity simply as the amount of character or phoneme in- sertions, deletions, and substitutions required to make one word equivalent to the other. However, in language change, certain sound correspondences are more likely than others. Several methods rely on such linguistic knowledge by convert- ing sounds into sound classes according to phonetic similarity [?]. For example, [15] consider a pair of words to be cognates when they match in their first two consonant classes. In addition to such heuristics, a common approach to automatic cognate identification is to use edit distance metrics using weightings based on previ- ously identified regular sound correspondences. Such correspondences can also be learned automatically by aligning the characters of a set of initial cognate pairs [3,7]. In addition to sound correspondences, [14] and [6] also utilise se- mantic information of word pairs, as cognates tend to have similar, though not necessarily equivalent, meaning. Another method heavily reliant on prior lin- guistic knowledge is the LexStat method [9], requiring a sound correspondence matrix, and semantic alignment. However, in the context of low-resource languages, prior linguistic knowledge such as initial cognate sets, semantic information, or phonetic transcriptions are rarely available. Therefore, cognate identification methods applicable to low- resource languages calls for unsupervised approaches. For example, [10] address this issue by investigating edit distance metrics based on embedding characters into a vector space, where character similarity depends on the set of characters they co-occur with. In addition, [12] investigate several unsupervised approaches such as hidden Markov models and pointwise mutual information, while also combining these with heuristic methods for improved performance. 3 Corpus The initial plan is to base our method on the nominal XML dictionaries for the three Sami languages available on the Giellatekno infrastructure. Apart from just translations, these dictionaries contain also additional lexical information to a varying degree. The additional information which might benefit our re- search goals are cognate relationships, semantic tags, morphological information, derivation and example sentences. For each noun the noun dictionaries, we produce a list of all its inflections in different grammatical numbers and cases. This is done by using a Python library called Uralic NLP [5], specialized in NLP for Uralic languages. Uralic NLP uses FSTs (finite-state-transducers) from the Giellatekno infrastructure to produce the different morphological forms. We are also considering a possibility of including larger text corpora in these languages as a part of our method for finding cognates. However, theses languages have notoriously small corpora available, which might render them insufficient for our purposes. 4 Future Work Our research is currently at its early stages. The immediate future task is to start implementing different methods based on the previous research to solve the problem. We will first start with edit distance approaches to see what kind of information those can reveal and move towards a more complex solution from there. A longer-term future plan is to include more languages into the research. We are also interested in a collaboration with linguists who could take a more qualitative look at the cognates found by our method. This will nourish inter- disciplinary collaboration and exchange of ideas between scholars of different backgrounds. We are also committed to releasing the results produced by our method to a wider audience to use and profit from. This will be done by including the results as a part of the XML dictionaries in the Giellatekno infrastructure and also by releasing them in an open-access MediaWiki based dictionary for Uralic languages [13] developed in the University of Helsinki. References 1. A ́lgu-tietokanta. saamelaiskielten etymologinen tietokanta (Nov 2006), http://kaino.kotus.fi/algu/ 2. Aikio, A.: The Saami loanwords in Finnish and Karelian. Ph.D. thesis, University of Oulu, Faculty of Humanities (2009) 3. Ciobanu, A.M., Dinu, L.P.: Automatic detection of cognates using orthographic alignment. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). vol. 2, pp. 99–105 (2014) 4. Ha ̈kkinen, K.: Suomen kirjakielen saamelaiset lainat. Teoksessa Sa ́mit, sa ́nit, sa ́tneha ́mit. Riepmoˇca ́la Pekka Sammallahtii miessema ́nu 21, 161–182 (2007) 5. Ha ̈ma ̈la ̈inen, M.: UralicNLP (Jan 2018), https://doi.org/10.5281/zenodo.1143638, doi: 10.5281/zenodo.1143638 6. Hauer, B., Kondrak, G.: Clustering semantically equivalent words into cognate sets in multilingual lists. In: Proceedings of 5th international joint conference on natural language processing. pp. 865–873 (2011) 7. Kondrak, G.: Identification of cognates and recurrent sound correspondences in word lists. TAL 50(2), 201–235 (2009) 8. Koponen, E.: Lappische lehnwo ̈rter im finnischen und karelischen. Lapponica et Uralica. 100 Jahre finnisch-ugrischer Unterricht an der Universita ̈t Uppsala. Vortra ̈ge am Jubil ̈aumssymposium 20.–23. April 1994 pp. 83–98 (1996) 9. List,J.M.,Greenhill,S.J.,Gray,R.D.:Thepotentialofautomaticwordcomparison for historical linguistics. PloS one 12(1), e0170046 (2017) 10. McCoy, R.T., Frank, R.: Phonologically informed edit distance algorithms for word alignment with low-resource languages. Proceedings of 11. Moshagen, S.N., Pirinen, T.A., Trosterud, T.: Building an open-source develop- ment infrastructure for language technology projects. In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16. pp. 343–352. No. 85, Linkping University Electronic Press; Linkpings universitet (2013) 12. Rama, T., Wahle, J., Sofroniev, P., Ja ̈ger, G.: Fast and unsupervised methods for multilingual cognate clustering. arXiv preprint arXiv:1702.04938 (2017) 13. Rueter, J., Ha ̈m ̈al ̈ainen, M.: Synchronized mediawiki based analyzer dictionary development. In: Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages. pp. 1–7 (2017) 14. St Arnaud, A., Beck, D., Kondrak, G.: Identifying cognate sets across dictionaries of related languages. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pp. 2519–2528 (2017) 15. Turchin, P., Peiros, I., Murray, G.M.: Analyzing genetic connections between lan- guages by matching consonant classes. Vestnik RGGU. Seriya ”Filologiya. Voprosy yazykovogo rodstva”, (5 (48)) (2010) Late-Breaking Work
Dissertations from Uppsala University 1602-1855 on the internet Uppsala University, Uppsala University Library At Uppsala University Library, a long-term project is under way which aims at making the dissertations, that is theses, submitted at Uppsala University in 1602-1855 easy to find and read on the Internet. The work includes metadata production, scanning and OCR processing as well as publication of images of the dissertations in full-text searchable pdf files. So far, approximately 3,000 dissertations have been digitized and made accessible on the Internet via the DiVA portal, Uppsala University’s repository for research publications. All in all, there are about 12,000 dissertations of about 20 pages each on average to be scanned. This work is done by hand, due to the age of the material. The project aims to be completed in 2020. Why did we prioritize dissertations? Even before the project started, dissertations were valued research material, and the physical dissertations were frequently on loan. Their popularity was primarily due to the fact that generally, studying university dissertations is a great way to study evolvements and changes in society. In the same way as doctoral theses do today, the older dissertations reflect what was going on in the country, at the University, and in the intellectual Western world on the whole at a certain period of time. The great mass of them makes them especially suitable for comparative and longitudinal studies, and provides excellent chances for scholars to find material little used or not used at all in previous research. Swedish older dissertations including those of today’s Finland specifically are also comparatively easy to find. In contrast to many other European libraries with an even longer history, collectors published bibliographies of Swedish dissertations as far back as 250 years ago. Our dissertations are also organized, bound and physically easily accessible. Last year the cataloguing of the Uppsala dissertations was completed according to modern standards in LIBRIS. That made them searchable according to subject and word in title, which was not possible before. All this made the digitization process smoother than that of many other kinds of cultural heritage material. The digital publication of the dissertations naturally made access to them even easier for University staff and students as well as lifelong learners in Sweden and abroad. How are the dissertations used today? In actual research today, we see that the material is frequently consulted in all fields of history. Dissertations provide scholars in the fields of history of ideas and history of science with insight into the status of a certain subject matter in Sweden in various periods of time, often in relation to the contemporary discussion on the European continent. The same goes for studies in history of literature and history of religion. Many of the dissertations examine subjects that remain part of the public debate today, and are therefore of interest for scholars in the political and social sciences. The languages of the dissertations are studied by scholars of Semitic, Classical and Scandinavian languages, and the dissertations often contain the very first editions and translations of certain ancient manuscripts in Arabic and Runic script. There is also a social dimension of the dissertations worthy of attention, as dedications and gratulatory poems in the dissertations mirror social networks in the educated stratum of Sweden in various periods of time. Illustrations in the dissertations were often made by local artists or the students themselves, and the great mass of gratulatory poems mirrors the less well-known side of poetry in early modern Sweden. Our users The users of the physical items are primarily university scholars, primarily our own University, but there is also quite a great deal of interest from abroad. Not least from our neighboring country Finland and from the Baltic States, which were for some time within the Swedish realm. Many projects are going on right now which include our dissertations as research material or which have them as their primary source material; Swedish projects as well as international. As Sweden as a part of learned Europe more or less shared the values, objects and methods of the Western academic world as a whole, to study Swedish science and scholarship is to study an important part of Western science and scholarship. As for who uses our digital dissertations, we in fact do not know. The great majority of the dissertations are written in Latin, as in all countries of Europe and North America, Latin was the vehicle for academic discussion in the early modern age. In the first half of the 19th century, Swedish became more common in the Uppsala dissertations. Among the ones digitized and published so far, a great deal are in Swedish. As for the Latin ones, they too are clearly much used. Although knowledge of Latin is quite unusual in Sweden, foreign scholars in the various fields of history often had Latin as part of their curriculum. Obviously, our users know at least enough Latin to recognize if a passage treats the topic of their interest. They can also identify which documents are important to them and extract the most important information from it. If the document is central, it is possible to hire a translator. But we believe that we also reach out to the lifelong learners, or the so-called “ordinary people”. The older dissertations examine every conceivable subject and they offer pleasant reading even for non-specialists, or people who use the Internet for genealogical research. The full text publication makes the dissertation show up, perhaps unexpectedly, when a person is looking for a certain topic or a certain word. Whoever the users the digital publication of the dissertations has been well received, and far beyond expectations. The first three test years of approximately 2,500 digitized dissertations published resulted in close to one million visits and over 170,000 downloads, i.e. over 4,700 per month. Even if we don’t – or perhaps because we don’t – either offer or demand advanced technologies for the use of these dissertations. The digital publication and the new possibilities for research The database in which the dissertations are stored and presented is the same database in which researchers, scholars and students of Uppsala University, and other Swedish universities, too, currently register their publications with the option to publish them digitally. This clears a path for new possibilities for researchers to become aware of and study the texts. Most importantly, it enables users to find documents in their field, spanning a period of 400 years in one search session. A great deal of the medical terms of diseases and body parts, chemical designations, and, of course, juridical and botanical terms are Latin and the same as were used 400 years ago, and can thus be used for localizing text passages on these topics. But the form of the text can be studied, too. Linguists would find it useful to make quantitative studies of the use of certain words or expressions, or just to find the words of interest for further studies. The usefulness of full-text databases are all known to us. But often one as a user gets either a well-working search system or a great mass of important texts, and seldom both. This problem is solved here by the interconnection between the publication database DiVA and the Swedish National Research Library System LIBRIS. The combination makes it possible to use an advanced search system with high functionality, thus reducing the Internet problem of too many irrelevant hits. It gives direct access to the digital full text in DiVA, and the option to order the physical book if the scholar needs to see the original at our library. Not least important, there is qualified staff appointed to care for the system’s long-term maintenance and updates, as part of their everyday tasks at the University Library. Also, the library is open for discussion with users. The practical work within the project and related issues As part of the digitization project, the images of the text pages are OCR-processed in order to create searchable full-text pdf files. The OCR process gives various results depending on the age and the language of the text. The OCR processing of dissertations in Swedish and Latin from ca. 1800 onwards results in OCR texts with a high degree of accuracy, that is, between 80 and 90 per cent, whereas older dissertations in Latin and in languages written in other alphabets will contain more inaccuracies. On this point we are not satisfied. Almost perfect results when it comes to the OCR-read text, or proof-reading, is a basic requirement for the full use and potential of this material. However, in this respect, we are dependent upon the technology which is available on the market, as this provides the best and safest product. These products were not developed for handling printing types of various sorts and sizes from the 17th and 18th centuries, and the development of these techniques, except when it comes to “Fraktur”, is slow or non-existing. If you want to pursue further studies of the documents, you can download the documents for free to your own computer. There are free programs on the Internet that help you merge several documents of your choice into one document, in order for you to be able to search through a certain mass of text. If you are searching for something very particular, you could of course also perform a word search in Google. One of our wishes for the future is to make it possible for our users to search in several documents of their specific choice at one time, without them having to download the documents to their computer. So, most important for us today within the dissertation project: 1) Better OCR for older texts 2) Easier ways to search in a large text mass of your own choice. Future use and collaboration with scholars and researchers The development of digital techniques for the further use of these texts is a future desideratum. We therefore aim to increase our collaboration with researchers who want to explore new methods to make more out of the texts. However, we always have to take into account the special demands from society when it comes to the work we, as an institute of the state, are conducting – in contrast to the work conducted by e.g. Google Books or research projects with temporary funding. We are expected to produce both images and metadata of a reasonably high quality – a product that the University can ‘stand for’. What we produce should have a lasting value – and ideally be possible to use for centuries to come. What we produce should be compatible with other existing retrieval systems and library systems. Important, in my opinion, is reliability and citability. A great problem with research on digitally borne material is, in my opinion, that it constantly changes, with respect to both their contents and where to find them. This puts the fundamental principle of modern science, the possibility to control results, out of the running. This is a challenge for Digital Humanities which, with the current pace of development, surely will be solved in the near future. Late-Breaking Work
Normalizing Early English Letters for Neologism Retrieval University of Helsinki Introduction Our project studies social aspects of innovative vocabulary use in early English letters. In this abstract we describe the current state of our method for detecting neologisms. The problem we are facing at the moment is the fact that our corpus consists of non-normalized text. Therefore, spelling normalization is the first step we need to solve before we can apply automatic methods to the whole corpus. Corpus We use CEEC (Corpora of Early English Correspondence) [9] as the corpus for our research. The corpus consists of letters ranging from the 15th century to the 19th century and it represents a wide social spectrum, richly documented in the metadata associated with the corpus, including information on e.g. socioeconomic status, gender, age, domicile and the relationship between the writer and recipient. Finding Neologisms In order to find neologisms, we use the information of the earliest attestation of words recorded in the Oxford English Dictionary (OED) [10]. Each lemma in the OED has information about its attestations, but also variant spelling forms and inflections. How we proceed in automatically finding neologism candidates is as follows. We get a list of all the individual words in the corpus, and we retrieve their earliest attestation from the OED. If we find a letter where the word has been used before the earliest attestation recorded in the OED, we are dealing with a possible neologism, such as the word "monotonous" in (1), which antedates the first attestation date given in the OED by two years (1774 vs. 1776). (1) How I shall accent & express, after having been so long cramped with the monotonous impotence of a harpsichord! (Thomas Twining to Charles Burney, 1774; TWINING_017) The problem, however, is that our corpus consists of texts written in different time periods, which means that there is a wide range of alternative spellings for words. Therefore, a great part of the corpus cannot be directly mapped to the OED. Normalizing with the Existing Methods Part of the CEEC (from the 16th century onwards) has been normalized with VARD2 [3] in a semi-automated manner; however, the automatic normalization is only applied to sufficiently frequent words, whereas neologisms are often rare words. We take these normalizations and extrapolate them over the whole corpus. We also used MorphAdorner [5] to produce normalizations for the words in the corpus. After this, we compared the newly normalized forms with those in the OED taking into account the variant forms listed in the OED. NLTK's [4] lemmatizer was used to produce lemmas from the normalized inflected forms to map them to the OED. In doing so, we were able to map 65,848 word forms of the corpus to the OED. However, around 85,362 word forms still remain without mapping to the OED. Different Approaches For the remaining non-normalized words, we have tried a number of different approaches. - Rules - SMT - NMT - Edit distance, semantics and pronunciation The simplest one of them is running the hand-written VARD2 normalization rules for the whole corpus. These are simple replacement rules that replace a sequence of characters with another one either in the beginning, end or middle of a word. An example of such a rule is replacing "yes" with "ies" at the end of the word. We have also trained a statistical machine translation model (with Moses [7]}) and a neural machine translation model (with OpenNMT [6]). SMT has previously been used in the normalization task, for example in [11]. Both of the models are character based treating the known non-normalized to normalized word pairs as two languages for the translation model. The language model used for the SMT model is the British National Corpus (BNC) [1]. One more approach we have tried is to compare the non-normalized words to the ones in the BNC by Levenshtein edit distance [8]. This results in long lists of normalization candidates, that we filter further by their semantic similarity, which means comparing the list of two word appearing immediately after and before the non-normalized word and the normalization candidates picking out the candidates with largest number of shared contextual words. And finally, filtering this list with Soundex pronunciation by edit distance. A similar method [2] has been used in the past for normalization which relied on the semantics and edit distance. The Open Question The above described methods produce results of varying degrees of success. However, none of them is reliable enough to be trusted above the rest. We are now in a situation in which at least one of the approaches finds the correct normalization most of the time. The next unsolved question is how to pick the correct normalization from the list of alternatives in an accurate way. Once the normalization has been solved, we are facing another problem which is mapping words to the OED correctly. For example, currently the verb "to moon" is mapped to the noun "mooning" recorded in the OED because it appeared in the present participle form in the corpus. This means that in the future, we have to come up with ways to tackle not only the problem of homonyms, but also the problem of polysemy. A word might have acquired a new meaning in one of our letters, but we cannot detect this word as a neologism candidate, because the word has existed in the language in a different meaning before. References 1. The British National Corpus, version 3 (BNC XML Edition). Distributed by Bodleian Libraries, University of Oxford, on behalf of the BNC Consortium (2007),http://www.natcorp.ox.ac.uk/ 2. Amoia, M., Martinez, J.M.: Using comparable collections of historical texts forbuilding a diachronic dictionary for spelling normalization. In: Proceedings of the7th workshop on language technology for cultural heritage, social sciences, andhumanities. pp. 84–89 (2013) 3. Baron, A., Rayson, P.: VARD2: a tool for dealing with spelling variation in histor-ical corpora (2008) 4. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’ReillyMedia (2009) 5. Burns, P.R.: Morphadorner v2: A java library for the morphological adornment ofEnglish language texts. Northwestern University, Evanston, IL (2013) 6. Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.M.: OpenNMT: Open-SourceToolkit for Neural Machine Translation. ArXiv e-prints 7. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N.,Cowan, B., Shen, W., Moran, C., Zens, R., et al.: Moses: Open source toolkit forstatistical machine translation. In: Proceedings of the 45th annual meeting of theACL on interactive poster and demonstration sessions. pp. 177–180. Associationfor Computational Linguistics (2007) 8. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, andreversals. In: Soviet physics doklady. vol. 10, pp. 707–710 (1966) 9. Nevalainen,T.,Raumolin-Brunberg,H.,Ker ̈anen,J.,Nevala,M.,Nurmi, A., Palander-Collin, M.: CEEC, Corpus of Early English Cor-respondence. Department of Modern Languages, University of Helsinki,http://www.helsinki.fi/varieng/CoRD/corpora/CEEC/ 10. OED: OED Online. Oxford University Press, http://www.oed.com/ 11. Pettersson, E., Megyesi, B., Tiedemann, J.: An SMT approach to automatic an-notation of historical text. In: Proceedings of the workshop on computational his-torical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALTProceedings Series 18. pp. 54–69. No. 087, Link ̈oping University Electronic Press(2013) Late-Breaking Work
Triadic closure amplifies homophily in social networks 1Aalto University, Finland; 2Next Games, Finland Much of the structure in social networks can be explained by two seemingly separate network evolution mechanisms: triadic closure and homophily. While it is typical to analyse these mechanisms separately, empirical studies suggest that their dynamic interplay can be responsible for the striking homophily patterns seen in real social networks. By defining a network model with tunable amount of homophily and triadic closure, we find that their interplay produces a myriad of effects such as amplification of latent homophily and memory in social networks (hysteresis). We use empirical network datasets to estimate how much observed homophily could actually be an amplification induced by triadic closure, and have the networks reached a stable state in terms of their homophily. Beyond their role in characterizing the origins of homophily, our results may be useful in determining the processes by which structural constraints and personal preferences determine the shape and evolution of society. |
2:30pm - 3:45pm | Plenary 4: Frans Mäyrä Session Chair: Eetu Mäkelä Game Culture Studies as Multidisciplinary (Digital) Cultural Studies. Watchable also remotely from PII, PIV and P674. |
Think Corner | |
4:00pm - 5:30pm | F-PII-2: Computational Linguistics 2 Session Chair: Risto Vilkko |
PII | |
|
4:00pm - 4:30pm
Long Paper (20+10min) [publication ready] Verifying the Consistency of the Digitized Indo-European Sound Law System Generating the Data of the 120 Most Archaic Languages from Proto-Indo-European 1University of Helsinki,; 2University of Colorado Boulder Using state-of-the-art finite-state technology (FST) we automatically generate data of the some 120 most archaic Indo-European (IE) languages from reconstructed Proto-Indo-European (PIE) by means of digitized sound laws. The accuracy rate of the automatic generation of the data exceeds 99%, which also applies in the generation of new data that were not observed when the rules representing the sound laws were originally compiled. After testing and verifying the consistency of the sound law system with regard to the IE data and the PIE reconstruction, we report the following results: a) The consistency of the digitized sound law system generating the data of the 120 most archaic Indo-European languages from Proto-Indo-European is verifiable. b) The primary objective of Indo-European linguistics, a reconstruction theory of PIE in essence equivalent to the IE data (except for a limited set of open research problems), has been provably achieved. The results are fully explicit, repeatable, and verifiable. 4:30pm - 4:45pm
Short Paper (10+5min) [publication ready] Towards Topic Modeling Swedish Housing Policies: Using Linguistically Informed Topic Modeling to Explore Public Discourse 1Gothenburg university; 2Graduate School of Education, Stanford University This study examines how one can apply the method topic modeling to explore the public discourse of Swedish housing policies, as represented by documents from the Swedish parliament and Swedish newstexts. This area is relevant to study because of the current housing crisis in Sweden. Topic modeling is an unsupervised method for finding topics in large collections of data and this makes it suitable for examining public discourse. However, in most studies which employ topic modeling there is a lack of using linguistic information when preprocessing the data. Therefore, this work also investigates what effect linguistically informed preprocessing has on topic modeling.Through human evaluation, filtering the data based on part of speech is found to have the largest effect on topic quality. Non-lemmatized topics are found to be rated higher than lemmatized topics. Topics from the filters based on dependency relations are found to have low ratings. 4:45pm - 5:00pm
Short Paper (10+5min) [abstract] Embedded words in the historiography of technology and industry, 1931–2016 University of Umeå, Sweden From 1931 to 2016 The Swedish National Museum of Science and Technology published a yearbook, Dædalus. The 86 volumes display a great diversity of industrial heritage and cultures of technology. The first volumes were centered on the heavy industry, such as mining and paper plants located in North and Mid-Sweden. The last volumes were dedicated to technologies and products in people’s everyday lives – lipsticks, microwave ovens, and skateboards. During the years Dædalus has covered topics reaching from individual inventors to world fairs, media technologies from print to computers, and agricultural developments from ancient farming tools to modern DNA analysis. The yearbook presents the history of industry, technology and science, but can also be read as a historiographical source reflecting shifting approaches to history over an 80-year period. Dædalus was recently digitized and can now be analyzed with the help of digital methods. The aim of this paper is twofold: To explore the possibilities of word embedding models within a humanities framework, and to examine the Dædalus yearbook as a historiographical source with such a model. What we will present is work in progress with no definitive findings to show at the time of writing. Yet, we have a general idea of what we would like to accomplish. Analyzing the yearbook as a historiographical source means that we are interested in what kinds of histories it represents, its focus and bias. We follow Ben Schmidt’s (admittedly simplified) suggestion that word embedding models for textual analysis can be viewed and used as supervised topic model tools (Schmidt, 2015). If words are defined by the distribution of the vocabulary of their contexts we can calculate relations between words and explore fields of related words as well as binary relations in order to analyze their meaning. Simple – and yet fundamental – questions can be asked: What is “technology” in the context of the yearbook? What is “industry”? Of special interest in the case of industrial and technological history are binaries such as rural/urban, man/woman, industry/handicraft, production/consumption, and nature/culture. Which words are close to “man”, and which are close to “woman”? Which aspects of the history of technology and industry are related to “production” and which are related to “consumption”? Word embedding is a comparatively new set of tools and techniques within data science (NLP) with that in common that the words in a vocabulary of a corpus (or several corpora) are assigned numerical representations through some (of a wide variety of different) computation. In most cases, this comes down to not only mapping the words to numerical vectors, but doing so in such a way that the numerical values in the vectors reflect the contextual similarities between words. The computations are based on the distributional hypothesis stemming from (Zellig Harris, 1954), implicating that “words which are similar in meaning occur in similar contexts” (Rubenstein & Goodenough, 1965). The words are embedded (positioned) in a high-dimensional space, each word represented by a vector in the space i.e. a simple representational model based on linear algebra. The dimension of the space is defined by the size of the vectors and the similarity between words then become a matter of computing the difference between vectors in this space, for instance the difference in (euclidian) distance or difference in direction between the vectors (cosine similarity). Within vector space models the former is the most popular under the assumption that related words tend to have similar directions. The arguably most prominent and popular of these algorithms, and the one that we have used, is the skip-gram model Word2Vec (Mikolov et al, 2013). In short, this model uses a neural network to compute the word vectors as results from training the network to predict the probabilities of all the words in a vocabulary being nearby (as defined by a window size) a certain word in focus. An early evaluation shows that the model works fine. Standard calculations often used to evaluate the performance and accuracy indicates that we have implemented the model correctly – we can indeed get the correct answers to equations such as “Paris - France + Italy = Rome” (Mikolov et al, 2013). In our case we were looking for “most_similar(positive=['sverige','oslo'], negative=['stockholm'])”. And the “most similar” was “norge”. We have also explored simple word similarity in order to evaluate the model and get a better understanding of our corpus. What remains to be done is to identify relevant words (or group of words) that can be used when we are examining “topics” and binary dimensions in the corpus. We are also experimenting with different ways to cluster and visualize the data. Although some work remains to be done, we will definitely have results to present at the time of the conference. Harris, Zellig (1954). Distributional structure. Word, 10(23):146–162. Mikolov, Tomas, Chen, Kai, Corrado, Greg & Dean, Jeffrey (2013). Efficient estimation of word representations in vector space. CoRR, abs/1301.3781 Rubenstein, Herbert & Goodenough, John (1965). Contextual Correlates of Synonymy. Communications of the ACM, 8(10): 627-633. Schmidt, Ben (2015). Word Embeddings for the digital humanities. Blog post at http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html. 5:00pm - 5:15pm
Short Paper (10+5min) [abstract] Revisiting the authorship of Henry VIII’s Assertio septem sacramentorum through computational authorship attribution University of Turku Undoubtedly, one of the great unsolved mysteries of Tudor history through centuries has been the authorship of Henry VIII’s famous treatise Assertio septem sacramentorum adversus Martinum Lutherum (1521). The question of its authorship intrigued the contemporaries already in the 1520s. With Assertio, Henry VIII gained from the Pope the title Defender of the Faith which the British monarchs still use. Because of the exceptional importance of the text, the question of its authorship is not irrelevant in the study of history. For various reasons and motivations each of their own, many doubted the king’s authorship. The discussion has continued to the present day. A number of possible authors have been named, Thomas More and John Fisher foremost among them. There is no clear consensus about the authorship in general – nor is there a clear agreement upon the extent of the King’s role in the writing process in the cases where joint authorship is suggested. The most commonly shared conclusion indeed is that the King was more or less helped in the writing process and that the authorship of the work was thus shared at least to some degree: that is, even if Henry VIII was active in the writing of Assertio, he was not the sole author but was helped by someone or by a group of theological scholars. In the case of Assertio, The Academy of Finland funded consortium Profiling Premodern Authors (PROPREAU) has tackled the difficult Latin source situation and put an effort into developing more efficient machine learning methods for authorship attribution in a case where large training corpora are not available. This paper will present the latest discoveries in the development of such tools and will report on the results. These will give historians tools for opening a myriad of questions we have been hitherto unable to answer. It is of great significance for the whole discipline of history to be able to name authors to texts that are anonymous or of disputed origin. Select Bibliography: Betteridge, Thomas: Writing Faith and Telling Tales: Literature, Politics, and Religion in the Work of Thomas More. University of Notre Dame Press 2013. Brown, J. Mainwaring: Henry VIII.’s Book, “Assertio Septem Sacramentorum,” and the Royal Title of “Defender of the Faith”. Transactions of the Royal Historical Society 1880, 243–261. Nitti, Silvana: Auctoritas: l’Assertio di Enrico VIII contro Lutero. Studi e testi del Rinascimento europeo. Edizioni di storia e letteratura 2005. |
4:00pm - 5:30pm | F-PIV-2: Digital History Session Chair: Mikko Tolonen |
PIV | |
|
4:00pm - 4:30pm
Long Paper (20+10min) [publication ready] Historical Networks and Identity Formation: Digital Representation of Statistical and Geo- Data to Mobilize Knowledge. Case Study of Norwegian Migration to the USA (1870-1920) National Library of Norway, The article is a result of the collaborative interdisciplinary workshop, which involved expertise from social sciences, history and digital humanities. It showed how computer mediated ways of researching historical networks and identity formation of Norwegian-Americans substantially complemented historical and social sciences methods. By using open API of the National Archives of Norway we used statistical, geo- and text data to produce an interactive temporal visualization of regional origins in Norway at the USA map. Spatial visualization allowed highlighting space and time and the changing regional belonging as fundamental values for understanding social and cultural dimensions of migrants’ lives. We claim that data visualizations of space and time have performative materiality (Drucker 2013). They open a free room for a researcher to come up with his/her own narrative about the studied phenomenon (Perez and Granger 2015). Visualizations make us reflect on the relationship between the phenomenon and its representation (Klein 2014). This digital method supplements the classical sociological and socio-constructivist methods and has therefore knowledge mobilizing effects. In the article, we show, what potentials this visualization has in relation to the particular field of emigration studies, when entering into a dialogue with the existing historical research in the field. 4:30pm - 4:45pm
Short Paper (10+5min) [abstract] Spheres of “public” in eighteenth-century Britain 1University of Helsinki; 2University of Turku The eighteenth-century saw a transformation in the practices of public discourse. With the emergence of clubs, associations, and, in particular, coffee houses, civic exchange intensified from the late seventeenth century. At the same time print media was transformed: book printing proliferated; new genres emerged (especially novels and small histories); works printed in smaller formats made reading more convenient (including in public); and periodicals - generally printed onto single folio half-sheets - emerged as a separate category of printed work which was written specifically for public consumption, and with the intention of influencing public discourse (such periodicals were intended to be both ephemeral and shared, often read, and then discussed, publically each day). This paper studies how these changes may be recognized in language by quantitatively studying the word “public” and its semantic context in the Eighteenth-Century Collections Online (ECCO). While there are many descriptions of the transformation of public discourse (both contemporary and historical), there has been limited research into the language revolving (and evolving) around “public” in the eighteenth-century. Jürgen Habermas (2003: 2-3) famously argues that the emergence of words such as “Öffentlichkeit” in German and “publicity” in English are indicative of a change in the public sphere more generally. The conceptual history of “Öffentlichkeit” has been further studied in depth by Lucian Hölscher (1978), but a systematic study of the semantic context of “public” in British eighteenth-century material is missing. Studies that have covered this topic, such as Gunn (1989), base their findings on a very limited set of source material. In contrast, this study, by using a large-scale digitized corpus, aims to supplement earlier studies that focus on individual speech acts or particular collections of sources, and provide a more comprehensive account of how the language of “public” changed in the eighteenth century. The historical subject matter means that the study is based on the ECCO corpus. While ECCO is in many ways an invaluable resource, a key goal of this study is to be methodologically sound from the perspective of corpus-linguistics and intellectual history, while developing insights which are relevant more generally to sociologists and historians. In this regard, ECCO does come with its own particular problems: both in terms of content and size. With regard to content: OCR mistakes remain problematic; its heterogeneity in genres can skew investigations; and the unpredictable nature of duplicate texts introduced by numerous reprints of certain volumes must be taken into account. However, many of these problems can be mitigated in different ways. For example, in specific cases we compare findings with the, much smaller, ECCO TCP (an OCR corrected subset of ECCO). We have further used the English Short Title Catalogue (ESTC) to connect textual findings with relevant metadata information contained in the catalogue. By merging ESTC metadata with ECCO, one can more easily use existing historical knowledge (for example, issues around reprints and multiple editions) to engage with the corpus. With regard to size: the corpus itself is too big to run automatic parsers. We have therefore extracted a separate, and smaller, corpus (with the help of ESTC metadata) to do more complex and demanding analyses. Results of these analyses were then replicated in a much simpler and cruder form on the whole dataset to gauge whether results corroborate the initial observations. The size constraints provide their own advantages, however. The smaller subsections were chosen to represent pamphlets and other similar short documents by extracting all documents with less than 10406 characters in them. Compared to other specific genres or text types, this proved to be a successful method when attempting to define a meaningful subcorpus, while at the same time limiting effects of reprints, and including a relatively large number of individual writers in the analysis. The subjects covered by pamphlets also tend to be quite historically topical, and as shorter texts, inspecting single occurrences in their original context is much more efficient as things such as main theme, context, and writer’s intentions reveal themselves comparatively quickly compared to larger works. Thus, issues around distant and close reading are more easily overcome. In addition, we are able to compare semantic change between the larger corpus and the more rapidly shifting topical and political debates found in pamphlets, which offers its own historical insights. In terms of specific linguistic approaches, analysis started with examinations of contextual distributions of “public” by year. Then, by changing the parameters of this analysis (for example, by defining the context as a set of syntactic dependencies engaged by public, or as collocation structures of a wider lexical environment) different aspects of the use of “public” can be brought to the foreground. As syntactic constraints govern possibilities of combinations of words in shorter ranges of context, the narrower context windows contain a lot of syntactic information in addition to collocational information. Because of this syntactic restrictedness of close range combinations, the semantic relatedness of words with similar short range context distributions is one of degree of mutual interchangeability and, as such, of metaphorical relatedness (Heylen, Peirsman, Geeraerts, Speelman 2008). Wider context windows, such as paragraphs, are free from syntactic constraints, and so semantic relatedness between two words with similar wide range context distributions carries information from frequent contiguity in context and can be described as more metonymical than metaphorical by nature, as is visible from applications based on term-document-matrices, such as topic modelling or Latent Semantic Analysis (cf. Blei, Ng and Jordan (2003) and Dumais (2005)) The syntactic dependencies were counted by analysing the pamphlet subcorpus using Stanford Lexical Parser (Cheng and Manning 2014). Results show changes in the tendency to use “public” as an adjective attribute and in compound positions. Since in English the overwhelmingly most frequent position for both adjective attributes and compounding attributes is preceding head words, this analysis could be adequately replicated using bigrams in the whole dataset. Lexical environments have been analysed by clustering second order collocations (cf. Bertels and Speelman (2014)) and replicated by using a random sampling from the whole dataset to produce the second order vectors. The study of all bigrams relating to “public” (such as “public opinion”, “public finances”, “public religion”) in ECCO provides for a broader analysis of the use of “public” in eighteenth-century discourse that not only focuses on particular compounds, but provides a better idea of which domains “public” was used in. It points towards a declining trend in relative frequency of religious bigrams during the course of the eighteenth century and rise in the relative frequency of secular bigrams - both political and economic. This allows us to present three arguments: First, it is argued that this is indicative of an overall shift in the language around “public” as the concept’s focus changed and it began to be used in new domains. This expansion of discourses or domains in which “public” was used is confirmed in the analyses of a wider lexical environment. Second, we also notice that some collocates to public, such as “public opinion” and “public good”, gained a stronger rhetorical appeal. They became tropes in their own right and gained a future orientation in political discourse in the latter half of the eighteenth century (Koselleck 1972). Third, by combining the results of the distributional semantics of “public” in ECCO with information extracted from ESTC, one can recognize how different groups used the language relating to “public” in different ways. For example, authors writing on religious topics tended to use “public” differently from authors associated with the enlightenment in Scotland or France. There are two important upshots to this study: the methodological and the historical. With regard to the former, the paper works as a convincing case study which could be used as an example, or workflow, for studying other words that are pivotal to large structural change. With regard to the latter, the work is of particular historical relevance to recent discussions in eighteenth century intellectual history. In particular, the study contributes to the critical discussion of Habermas that has been taking place in the English-speaking world since the translation of his Structural Transformation of the Public Sphere in 1989, while also informing more traditional historical analyses which have not been able to draw tools from the digital humanities (Hill 2017). References Bertels, Ann and Dirk Speelman (2014). “Clustering for semantic purposes. Exploration of semantic similarity in a technical corpus.” Terminology 20:2, pp. 279–303. John Benjamins Publishing Company. Blei, David, Andrew Y. Ng and Michael I. Jordan (2003). “Latent Dirichlecht Allocation.” Journal of Machine Learning Research 3 (4–5). Pp. 993–1022. Chen, Danqi and Christopher D Manning (2014). “A Fast and Accurate Dependency Parser using Neural Networks.” Proceedings of EMNLP 2014. Dumais, Susan T. (2005). Latent Semantic Analysis. Annual Review of Information Science and Technology. 38: 188–230. Gunn, J.A.W. (1989). “Public opinion.’ Political Innovation and Conceptual Change (Edited by Terence Ball, James Farr & Rusell L. Hanson). Cambridge: Cambridge University Press. Habermas, Jürgen (2003 [1962]). The Structural Transformation of the Public Sphere: An Inquiry into a Category of Bourgeois Society. Cambridge: Polity. Heylen, Christopher, Yves Peirsman, Dirk Geeraerts and Dirk Speelman (2008). “Modelling Word Similarity: An Evaluation of Automatic Synonymy Extraction Algorithms.” Proceedings of LREC 2008. Hill, Mark J. (2017), “Invisible interpretations: reflections on the digital humanities and intellectual history.” Global Intellectual History 1.2, pp. 130-150. Hölscher, Lucian (1978), “‘Öffentlichkeit.’” Otto Brunner et al. (Hrsg.) Geschichtliche Grundbegriffe. Historisches Lexikon zur politisch-sozialen Sprache in Deutschland. Band 4, Stuttgart, Klett-Cotta, pp. 413–467. Koselleck, Reinhart (1972), “‘Einleitung.’” Otto Brunner, Werner Conze & Reinhart Koselleck (hrsg.), Geschichtliche Grundbegriffe. Historisches Lexikon zur politisch-sozialen Sprache in Deutschland. Band I, Stuttgart, Klett-Cotta, pp. XIII–XXVII. 4:45pm - 5:00pm
Short Paper (10+5min) [abstract] Charting the ’Culture’ of Cultural Treaties: Digital Humanities approaches to the history of international ideas Uppsala University Cultural treaties are the bi-lateral or sometimes multilateral agreements among states that promote and regulate cooperation and exchange in the fields of life we call cultural or intellectual. Pioneered by France just after World War I, this type of treaty represents a distinctive technology of modern international relations, a tool in the toolkit of public diplomacy, a vector of “soft power.” One goal of a comparative examination of these treaties is to locate them in the history of public diplomacy and in the broader history of culture and power in the international arena. But these treaties can also serve as sources for the study of what the historian David Armitage has called “the intellectual history of the international.” In this project, I use digital humanities methods to approach cultural treaties as a historical source with which to explore the emergence of a global concept of culture in the twentieth century. Specifically, the project will investigate the hypothesis that the culture concept, in contrast to earlier ideas of civilization, played a key role in the consolidation of the post-World War II international order. I approach the topic by charting how concepts of culture were given form in the system of international treaties between 1919 (when the first such treaty was signed) and 1972 (when UNESCO’s Convention on cultural heritage marked the “arrival” of a global embrace of the culture concept), studying them with the large-scale, quantitative methods of the digital humanities, as well as with the tools of textual and conceptual analysis associated with the study of intellectual history. In my paper for DH Nordic 2018, I will outline the topic, goals, and methods of the project, focusing on the ways we (that is, my colleagues at Umeå University’s HUMlab and I) seek to apply DH approaches to this study of global intellectual history. The project uses computer-assisted quantitative analysis to analyze and visualize how cultural treaties contributed to the spread of cultural concepts and to the development of transnational cultural networks. We explore the source material offered by these treaties by approaching it as two distinct data sets. First, to chart the emergence of an international system of cultural treaties, we use quantitative analysis of the basic information, or “metadata” (countries, date, topic) from the complete set of treaties on cultural matters between 1919 and 1972, approximately 1250 documents. Our source for this information is the World Treaty Index (www.worldtreatyindex.com). This data can also help identify historical patterns in the emergence of a global network of bilateral cultural treaties. Once mapped, these networks will allow me to pose interesting questions by comparing them to any number of other transnational systems. How, for example, does the map of cultural agreements compare to that of trade treaties, military alliances, or to the transnational flows of cultural goods, capital, or migrants? Second, to identify the development of concepts, we will observe the changing use of key terms through quantitative analysis of the treaty texts. By treating a large group of cultural treaties as several distinct text corpora and, perhaps, as a single text corpus, we will be able explore the treaties using textometry and topic modeling. The treaty texts (digital versions of most which can be found online) will be limited to four subsets: a) Britain, France, and Italy, 1919-1972; b) India, 1947-1972; c) the German Reich (1919-1945) and the two German successor states (1949-1972); and d) UNESCO’s multilateral conventions (1945-1972). This selection is designed to approach a global perspective while taking into account practical factors, such as language and accessibility. Our use of text analysis seeks (a) to offer insight into the changing usage and meanings of concepts like “culture” and “civilization”; (b) to identify which key areas of cultural activity were regulated by the treaties over time and by world region; and (c) to clarify whether “culture” was used in a broad, anthropological sense, or in a narrower sense to refer to the realm of arts, music, and literature. This aspect of the project raises interesting challenges, for example regarding how best to manipulate a multi-lingual text corpus (with texts in English, French, and German, at least). In these ways, the project seeks to contribute to our understanding of how the concept of culture that guides today’s international society developed. It also explores how digital tools can help us ask (and eventually answer) questions in the field of global intellectual history. 5:00pm - 5:15pm
Short Paper (10+5min) [abstract] Facilitating Digital History in Finland: What can we learn from the past? Aalto University The paper discusses the findings of “From Roadmap to Roadshow: A collective demonstration & information project to strengthen Finnish digital history” project. The project develops the history disciplines in Finland as a collaborative project. The project received funding from the Kone Foundation. The long paper proposed for the DHN2018 will discuss what we have learned about the present day conditions of digital history in Finland, how digital humanities is facilitated today in Finland and abroad, and what suggestions we could give for strengthening the conditions for doing digital history research in Finland. At the first phase of the project we did a survey among Finnish historians and identified several critical issues that require further development. They were the following: creating better, up-to-date information channels of digital history resources and events, providing relevant education, skills, and teaching by historians, and the need to help historians and information technology specialists to meet and collaborate better and more systematically than before. Many historians also had issues with the concept of digital history and difficulties with such an identity. In order to situate Finnish digital history in the domestic and international contexts, we have studied the roots of the computational history research in Finland, which date back to the 1960s, and the best practice of how digital history is currently done internationally. We have visited selected digital humanities centers in Europe and the US, which we have identified as having “done something right”. Based on these studies, visits and interviews we will propose steps to be taken for further strengthen the digital history research community in Finland. |
4:00pm - 5:30pm | F-P674-2: Between the Manual and the Automatic Session Chair: Eero Hyvönen |
P674 | |
|
4:00pm - 4:15pm
Short Paper (10+5min) [publication ready] In search of Soviet wartime interpreters: triangulating manual and digital archive work University of Helsinki This paper demonstrates the methodological stages of searching for Soviet wartime interpreters of Finnish in the digital archival resource of the Russian Ministry of Defence called Pamyat Naroda (Memory of the People) 1941–1945. Since wartime interpreters do not have their own search category in the archive, other means are needed to detect them. The main argument of this paper is that conventional manual work must be done and some preliminary information obtained before entering the digital archive, especially when dealing with a marginal subject such as wartime interpreters. 4:15pm - 4:30pm
Distinguished Short Paper (10+5min) [abstract] Digital Humanities Meets Literary Studies: the Challenges for Estonian Scholarship 1Tallinn University; 2Estonian Literary Museum In recent years, the application of DH as a method of computerised analysis and the extensive digitisation of literary texts, making them accessible as open data and organising them into large text corpora, have made the relations between literature and information technology a hot topic. New directions in literary history link together literary analysis, computer technology and computational linguistics, offering new possibilities for studying the authors’ style and language, analysing texts and visualising results. Along such mainstream uses, DH still contain several other important directions for literary studies. The aim of this paper is to check out the limits and possibilities of DH as a concept and to determine their suitability for literary research in the digital age. Our discussion is based, first, on the twenty-year-long experience of digital representing of Estonian literary and cultural heritage and, second, on the synchronous study of digitally born literary forms; we shall also offer more representative examples. We shall also discuss the concept of DH from the viewpoint of literary studies, e.g., we examine the ways of positioning the digitally created literature (both “electronic literature” and the literature born in social media) under this renewed concept. This problem was topical in the early 2000s, but in the following decade it was replaced by the broader ideas of intermedia and transmedia, which treated literary texts only as one medium among many others. Which are the specific features of digital literature, which are its accompanying effects and how has the role of the reader as the recipient changed in the digital environment? These theoretical questions are also indirectly relevant for making the literature created in the era of printed books accessible as e-books or open data. Digitising of older literature is the responsibility of memory institutions (libraries, archives, museums). Extensive digitising of texts at memory institutions seems to have been done for making reading more convenient – books can be read even on smartphones. Digitising works of fiction as part of the projects for digitising cultural heritage has been carried out for more than twenty years. What is the relation of these virtual bookshelves with the digital humanities? We need to discover whether and how do both the digitally born literature and the digitised literature that was born in the era of printing have an effect on literary theory. Our paper will also focus on mapping different directions, practices and applications of DH in the present day literary theory. The topical question is how to bridge the gap between the research possibilities offered by the present day DH and the ever increasing resources of texts, produced by memory institutions. We encounter several problems. Literary scholars are used to working with texts, analysing them as undivided works of poetry, prose or drama. Using of DH methods requires the treating of literary works or texts as data, which can be analysed and processed with computer programmes (data mining, using visualisation tools, etc.). These activities require the posing of new and totally different research questions in literary studies. Susan Schreibman, Ray Siemens and John Unsworth, the editors of the book A New Companion to Digital Humanities (2016), discuss the problems of DH and point out in their Foreword that it is still questioned whether DH should be considered a separate discipline or, rather, a set of different interlinked methods. In our paper we emphasise the diversity of DH as an academic field of research and talk about other possibilities it can offer for literary research in addition to computational analyses of texts. In Estonia, research on the electronic new media and the application of digital technology in the field of literary studies can be traced back to the second half of the 1990s. The analysis of social, cultural and creative effect (see Schreibman, Siemens, Unsworth 2016: xvii-xviii), as well as constant cooperation with social sciences in the research of the Internet usage have played an important role in Estonian literary studies. 4:30pm - 4:45pm
Short Paper (10+5min) [abstract] Digital humanities and environmental reporting in television during the Cold War Methodological issues of exploring materials of the Estonian, Finnish, Swedish, Danish, and British broadcasting companies University of Turku, Degree Programme on Cultural Production and Landscape Studies Environmental history studies have relied on traditional historical archival and other related source materials so far. Despite the increasing availability of new digitized materials studies in this field have not reacted to these emerging opportunities in any particular way. The aim of the proposed paper is to discuss possibilities and limitations that are embodied in the new digitized source materials in different European countries. The proposed paper is an outcome of a research project that explores the early days of television prior to the Earth Day in 1970 and frame this exploration from an environmental perspective. The focus of the project is reporting of environmental pollution and protection during the Cold War. In order to realize this study the quantity and quality of related digitized and non-digitized source materials provided by the national broadcasting companies of Estonia (ETV), Finland (YLE), Sweden (SVT), Denmark (DR), and United Kingdom (BBC) were examined. The main outcome of this international comparative study is that the quantity and quality of available materials varies greatly, even in a surprising way between the examined countries that belonged to different political spheres (Warsaw Pact, neutral, NATO) during the Cold War. 4:45pm - 5:00pm
Short Paper (10+5min) [abstract] Prosodic clashes between music and language – challenges of corpus-use and openness in the study of song texts University of Helsinki, In my talk I will discuss the relationship between linguistic and musical rhythm, and the connections to digital humanities and open science that arise in their study. My ongoing corpus research discusses the relationship between linguistic and musical segment length in songs, focusing on instances where the language has adapt prosodically to the rhythmic frame provided by pre-existing music. More precisely, the study addresses the question of how syllable length and note length interact in music. To what extent can non-conformity between linguistic and musical segment length, clashes, be acceptable in song lyrics, and what other prosodic features, such as stress, may influence the occurrence of clashes in segment length? Addressing these questions with a corpus-based approach leads to questions of retrieving information retrieval complicated corpora which combine two medias (music and language), and the openness and accessibility of music sources. In this abstract I will first describe my research questions and the song corpus used in my study in section 1, and discuss their relationship with the use, analysis and availability of corpora, and issues of open science in section 2. 1. Research setting and corpus My study aims to approach the comparison of musical and linguistic rhythm by both qualitative and statistical methods. It bases on a self-collected song corpus in Finnish, a language where syllable length has a versatile relationship with stress (cf. Hakulinen et al 2004). Primary stress in Finnish is weight-insensitive and always falls on the first syllable of a word, and syllables of any length, long or short, can be stressed or unstressed. Finnish sound segment length is also phonemic, that is, creates distinctions of meaning. Syllable length in Finnish is therefore of particular interest in a study of musical segment length, because length deviations play an evident role in language perception. Music and text can be combined into a composition in a number of ways, but my study focuses on the situations in which language is the most dependent of music. Usually there are three alternative orders in which music and language can be combined into songs: First, text and music may be written simultaneously and influence the musical and linguistic choices of the writer at the same time (Language < – > Music). Secondly, text can precede the music, as when composers compose a piece to existing poetry (Language –> Music). And finally, the melody may exist first, as when new versions of songs are created by translating or otherwise rewriting them to familiar tunes (Music –> Language). My research is concerned with this third relationship, because it poses the strongest constraints on the language user. The language (text) must conform to the music’s already existing rhythmic frame that is in many respects inflexible, and in such cases, it is difficult to vary the rhythmic elements of the text, because the musical space restricts the rhythmic tools available for the language user. This in turn may lead to non-neutral linguistic output. Thus the crucial question arises: How does language adapt its rhythm to music? My corpus contains songs that clearly and transparently represent the relationship of music being created first and providing the rhythmic frame, and language having to adjust to that frame. The pilot corpus consists of 15 songs and approximately 1500 prosodically annotated syllables of song texts in Finnish, translated or otherwise adapted from different languages, or written to instrumental or traditional music. The genres include chansons, drinking songs, Christmas songs and hymns, which originate from different eras and languages, namely English, French, German, Swedish, and Italian. One data point in the table format of the corpus is a Finnish syllable, the prosodic properties of which I compare with the rhythm of the respective notes (musical length and stress). The most basic instance of a clash between segment lengths is the instance where a short syllable ((C)V in Finnish) falls on a long note (i.e. a longer note than a basic half-beat) . Both theoretical and empirical evidence will be used to determine which length values create the clearest cases of prosodic clashes. A crucial presupposition when problematising the relationship between a musical form and the text written to it is the notion that a song is not poetry per se (I will return to this conception in section 2). The conventions of Western art music allow for a far greater range of length distinctions than language: the syllable lengths usually fall into binary or ternary categories (e.g. short and long syllables), whereas in music notes can be elongated infinitely. A translated song in which all rhythmic restrictions come from the music may follow the lines of poetic traditions, but must deviate from them if the limits of space within music do not allow for full flexibility. It is therefore an intermediate form of verbal art. 2. Challenges for digital humanities and open science The corpus-based approach to language and music poses problematic questions regarding digital humanities. First of these is, of course, if useful music-linguistic corpora can be found at all at the present. Existent written and spoken corpora of the major European languages contain millions of words, often annotated to a great linguistic detail (cf. Korp of Kielipankki for Finnish (korp.csc.fi), which offers detailed contextual, morphological and syntactic analysis). For music as well, digital music scores can be found “in a huge number” (Ponce de León et al. 2008:560). Corpora of song texts with both linguistic and musical information seem to be more difficult to find. One problem of music linguistic studies is related to the more restricted openness and shareability of sources than that of written or spoken language. The copyright questions of art are in general a more sensitive issue than for instance those of newspaper articles or internet conversations, and the reluctance of the owners of song texts and melodies may have made it difficult to create open corpora of contemporary music. But even with ownership problems aside (such as with older or traditional music), building a music-linguistic corpus remains a difficult task to comply. A truly useful corpus of music for linguistic purposes would include metadata of both medias, both language and music. Thus even an automatically analysed metric corpus of poetry, like Anatoli Starostin’s Treeton for metrical analysis of Russian poems (Pilshcikov & Starostin 2011) or the rhythmic Metricalizer for determining meter by stress patterns in German poems (Bobenhausen 2011) does not answer to the questions of rhythm of a song text, which exists in a extra-linguistic medium, music, altogether. Vocal music is metrical, but it is not metrical in the strict sense of poetic conventions, with which it shares the isochronic base. Automated analysis of a song text without its music notation does not tell anything about its real metrical structure. On a technical level, a set of tools that is necessary for researchers of music are the tools for quick visualization of music passages (notation tools, sound recognition). Such software can be found and used freely in the internet and are useful for depiction purposes. Mining of information from music requires more effort, but has been done in various projects for instance for melody information retrieval (Ponce de León et al. 2008), or metrical detection of notes (Temperley 2001). But again, these tools seem to rarely combine linguistic and musical meter simultaneously. By raising these questions I hope to bring attention to the challenges of studying texts in the musical domain, that is, not simply music or poetry separately. The crux of the issue is that for the linguistic analysis of song texts we need actual textual data where the musical domain appears as annotated metadata. Means exist to analyse text automatically, and to analyse musical patterns with sound recognition or otherwise, but to combine the two raises the analysis to a more complicated level. Literature Blumenfeld, Lev. 2016. End-weight effects in verse and language. In: Studia Metrica Poet. Vol. 3.1 pp. 7–32. Bobenhausen, Klemens. 2011. The Metricalizer – Automated Metrical Markup of German Poetry. In: Küper, C. (ed.), Current trends in metrical analysis, pp. 119-131. Frankfurt am Main; New York: Peter Lang. Hayes, Bruce. 1995. Metrical Stress Theory: principals and case studies. Chicago: The University of Chicago Press. Hakulinen, et al. (eds.). 2004. Iso suomen kielioppi, pp.44–48. Helsinki: Suomalaisen Kirjallisuuden Seura. Jeannin, M. 2008. Organizational Structures in Language and Music. In: The World of Music,50(1), pp. 5–16. Kiparsky, Paul. 2006. A modular metrics for folk verse. In: B. Elan Dresher & Nila Friedberg (eds.), Formal approaches to poetry: recent developments in metrics, pp.7–52. Berlin: Mouton de Gruyter. Lerdahl, Fred & Jackendoff, Ray. 1983. A generative theory of tonal music. Cambridge (MA): MIT. Lotz, John. 1960. Metric typology. In: Thomas Sebeok (ed.), Style in language. Massachusetts: The M.I.T. Press. Palmer, Caroline & Kelly, Michael H. 1992. Linguistic Prosody and Musical Meter in Song. Journal of memory and language 31, pp. 525–542. Pilshchikov, Igor & Starostin, Anatoli. 2011. Automated Analysis of Poetic Texts and the Problem of Verse Meter. In: Küper, C. (ed.), Current trends in metrical analysis, pp. 133–140. Frankfurt am Main; New York: Peter Lang. Ponce de León, Pedro J., Iñesta, José M. & Rizo, David. 2008. Mining Digital Music Score Collections: Melody Extraction and Genre Recognition. In: Peng-Yeng Yin (ed.), Pattern Recognition Techniques, Technology and Applications, pp. 626–. Vienna: I-Tech. Temperley, D. 2001. The Cognition Of Basic Musical Structures. Cambridge, Mass: MIT Press. 5:00pm - 5:15pm
Distinguished Short Paper (10+5min) [abstract] Finnish aesthetics in scientific databases Aalto University School of Arts, Design and Architecture The major academic databases such as Web of Science and Scopus are dominated by publications written in English, often by scholars affiliated to American and British universities. As such databases are repeatedly used as basis for assessing and analyzing activities and impact of universities and even individual scholars, there is a risk that everything published in other, especially minor languages, will be sidetracked. Standard data-mining procedures do not notice them. Yet, especially in humanities, other languages and cultures have an important role and scholars publish in various languages. The aim of this research project is to critically look into how Finnish aesthetics is represented in scientific databases. What kind of picture of Finnish aesthetics can we draw if we rely on the metadata from commonly used databases? We will address this general issue through one example. We will compare metadata from two different databases, in two different languages, English and Finnish, and form a picture of two different interpretations of an academic field, aesthetics - or estetiikka in Finnish. To achieve this target we will employ citation analysis, as well as text summarization techniques, in order to understand the differences lying between the largest world scientific database - Scopus, and the largest Finnish one - Elektra. Moreover, we will identify the most influential Finnish aestheticians and analyze their publications record in order to understand to what extent the scientific databases can represent Finnish aesthetics. Through this, we will present 1) two different maps containing actors and works recognized in the field, and 2) an overview of the main topics from two different databases. For these goals, we will collect metadata from the both Scopus and Elektra databases and references from each relevant article. Relevant articles will be located by using keyword “aeshetics” or the Finnish equivalent “estetiikka”, as well as identifying scientific journals focusing on aesthetics. We will perform citation analysis to explore in which countries which publications are cited, based on Scopus data. This comparison will allow us to understand what are the most prominent works for different countries, as well as to find the countries in which those works are developed, e.g., works that are acknowledged by Finnish aestheticians according to international database. In addition, the comparison will allow us to understand how Finnish aesthetics differs from other countries. Later, we will perform citation analysis with the data gathered from the Finnish scientific database Elektra. Results will indicate distribution between cited Anglo-American texts and the ones written in Finland or in Finnish language. Thus we could understand which language-family sources Finnish aestheticians rely on in their works. Further we will apply text summary techniques to see the differences in the topics both databases are discussing. Furthermore, we will collect a list names of the most influential Finnish aestheticians, and their works (as provided by the databases). We will perform searches within two databases to understand how much of their works are covered. As additional contribution, we will be developing an interactive web based tool to represent results of this research. Such tool will give an opportunity for aesthetics researchers to explore Finnish aesthetics field through our established lenses and also comment on possible gaps in the pictures offered by the databases. It is possible that databases only give a very partial picture of the field and in this case new tools should be developed in co-operation with researchers. The similar situation might be true also in other sub-fields of humanities where non-English activities are usual. |
4:00pm - 5:30pm | F-TC-2: Games as Culture Session Chair: Frans Mäyrä |
Think Corner | |
|
4:00pm - 4:15pm
Short Paper (10+5min) [abstract] The Science of Sub-creation: Transmedial World Building in Fantasy-Based MMORPGs University of Waterloo, The Games Institute, First Person Scholar My paper examines how virtual communities are created by fandoms in massively multi-player online role-playing games and it explores what kinds of self-construction emerge in these digital locales and how such self-construction reciprocally affects the living culture of the game. I assert that the universe of a fantasy-based MMORPG necessitates participatory culture: experiencing the story means participating in the culture of the story’s world; these experiences reciprocally affect the living culture of the game’s universe. The participation and investment of readers, viewers, and players in this world constitute what Carolyn Marvin calls a textual community or a group that “organize[s] around a presumptively shared, but distinctly practiced, epistemology of texts and interpretive procedures” (12). In other words, the textual community produces a shared discourse, one that informs and interrogates what it means to be a fan in both analogue and digital environments. My paper uses J.R.R. Tolkien’s Middle-earth as a case study to explore the creation and continuation of a fantastic universe, in this case Middle-earth, across mediums: a transmedial creation informed by its textual community. Building on the work of Mark J.P. Wolf, Colin B. Harvey, Celia Pearce, Matthew P. Miller, and Edward Castronova, my work reveals that the “worldness” of a transmedia universe, or the degree to which it exists as a complete and consistent cosmos, plays a core role in the production, acceptance, and continuation of its ontology among and across the fan communities respective to the mediums in which it operates. My paper argues that Tolkien’s literary texts and these associated adaptations are multi-participant sites in which participants negotiate their sense of self within a larger textual community. These multi-participant sites form the basis from which to investigate the larger social implications of selfhood and fan participation. My theoretical framework provides the means by which to situate the critical aesthetics relative to how this fictional universe draws participants in. Engaging with Gordon Calleja’s discussions on immersion and Luis O. Arata’s thoughts on interactivity, I demonstrate how the transmedial storyworld of Middle-earth not only constructs a sense of space but that it is precisely this sense of space that engages the reader, viewer or gamer. To situate the sense of self incurred between and because of narrative and storyworld environment, I draw from Andreas Gregersen’s work on embodiment and interface, as well as from Shawn P. Wilbur’s work on identity in virtual communities. Anne Balsamo and Rebecca Borgstrom each offer a theorization of the role-playing specific to the multiplayer environments of game-based adaptations, while William H. Huber’s work contextualizes the production of space in epic fantasy narratives. Together, my theoretical framework highlights how the spread of a transmedial fantastic narrative impacts the connection patterns across the textual community of a particular storyworld, as well as foregrounds how the narrative environment shapes the degree of participant engagement in and with the space of that storyworld. This proposal is for a long paper presentation; however, I'm able to condense if necessary to fit a short paper presentation. 4:15pm - 4:30pm
Distinguished Short Paper (10+5min) [abstract] Layers of History in Digital Games University of Helsinki, The past five years have seen a huge increase in historical games studies. Quite a few texts have tried to approach how history is presented and used in games, considering everything from philosophical points to more practical views related to historical culture and the many manifestations of heritage politics. The popularity of recent games like Assassin’s Creed, The Witcher and Elder Scrolls also manifests the current importance of deconstructing the messages and choices the games present. Their impact on the modern understanding of history, and the general idea of time and change, is yet to be seen in its full effect. The paper at hand is an attempt to structure the many layers or horizons of historicity in digital games as these, into a single taxonomic system for researchers. The suggestion considers the various consciousnesses of time and narrative models modern games work with. Several distinct horizons of time, both of design and of the related real life, are interwoven to form the end product. The field of historical game studies could find this tool quite useful, in its urgent need to systematize how digital culture is reshaping our minds and pasts. The model considers aspects like memory culture, uses of period art and apocalyptic events, narrative structures, in-game events and real world discourses as parts of how a perception of time and history is created or adapted. The suggested “layering of time” is applicable on a wide scale of digital games. 4:30pm - 4:45pm
Short Paper (10+5min) [abstract] Critical Play, Hybrid Design and the Performance of Cultural Heritage Game/Stories University of Skövde In my talk, I propose to discuss the critical relationship between games designed and developed for cultural heritage and emergent Digital Humanities (DH) initiatives that focus on (re-)inscribing and reflecting on the shifting boundaries of human agency and its attendant relations. In particular, I will highlight theoretical and practical humanistic models (for development and as objects of scholarly research) that are conceived in tension with more computational emphases and influences. I examine how digital heritage games move us from an understanding of digital humanities as a “tool” or “text” oriented discipline to one where we identify critical practices that actively engage and promote convergent, hybrid and ontologically complex techno-human subjects to enrich our field of inquiry as DH scholars. Drawing on principles such as embodiment, affect, and performativity, and analyzing transmedial storytelling and mixed reality games designed for heritage settings (and developed in my university research group), I argue for these games as an exemplary medium for enriching interdisciplinary digital humanities practices using methods currently called upon by recent DH scholarship. In these fully hybrid contexts where human/technology boundaries are richly intermingled, we recognize the importance of theoretical approaches for interpretation that are performative, not mechanistic (Drucker, in Gold, 2011): That is we look at emergent experiences, driven by human intervention, not affirmed by technological development and technical interface affordances. Such hybridity, driven by human/humanities approaches is explored more fully, for example, in Digital_Humanities by Burdick et al (2012) and by N. Katherine Hayles in How We Think: Digital Media and Contemporary Technogenesis (2012). Collectively these scholars reveal how transformative and emerging disciplines can work together to re-think the role of the organic-technical beings at the center (and found at the margins and in-between subjectivities) within new forward-thinking DH studies. Currently, Hayles and others, like Matthew Gold (2012) offer frameworks for more interdisciplinary Digital Humanities methods (including Comparative Media and Culture Studies approaches) that are richly informed by investigations into the changing role and function of the user of technologies and media and the human/social contexts for use. Hayles, for example, explicitly claims that in Digital Humanities humans “ think, through, with, and alongside media” (1). In essence, our thinking and being, our digitization and our human-ness are mutually productive and intertwined. Furthermore, we are multisensory in our access to knowing and we develop an understanding of the physical world in new ways that reorient our agencies and affects, redistributing them for other encounters with cultural and digital/material objects that are now ubiquitous and normalized. Ross Parry, museum studies scholar, supports a similar model for inquiry and future advancement, based on the premise that digital tool use is now fully implemented and accepted in museum contexts, and so now we must deepen and develop our inquiries and practice (Parry, 2013). He claims that digital technologies have become normative in museums and that currently we find ourselves, then, in the age of the postdigital. Here critical scrutiny is key and necessary to mark this advanced state of change. For Parry this is an opportune, yet delicate juncture that requires a radical deepening of our understanding of the museums’ relationship to digital tools: Postdigitality in the museum necessitates a rethinking of upon what museological and digital heritage research is predicated and on how its inquiry progresses. Plainly put, we have a space now (a duty even) to reframe our intellectual inquiry of digital in the museum to accommodate the postdigital condition. [Parry, 36] For Parry, as with current DH calls for development, we must now focus on the contextualized practices in which these technologies will inevitably engage designers and users and promote robust theoretical and practical applications. I argue that games, and in particular digital games designed for heritage experiences, are unique training grounds for such postdigital future development. They provide rich contexts for DH scholars working to deepen their understanding of performative and active interventions and intra-actions beyond texts and tools. As digital games have been adopted and ubiquitously assimilated in museums and heritage sites, we have opportunities to study experiences of users as they performatively engage postdigital museum sites through rich forms of hybrid play. In such games, nuanced forms of interdisciplinary communication and storytelling happen in deeply integrated and embedded user/technology relationships. In heritage settings, interpretation is key to understanding histories from multiple user-driven perspectives, and it happens in acts of dynamic emergence, not as the result of mechanistic affordance. As such DH designers and developers have much to learn from a rich body of games and heritage research, particularly that focused on critical and rhetorical design for play, Mixed Reality (MR) approaches and users’ bodies as integral to narrative design (Anderson et. al, 2010; Bogost, 2010; Flanagan, 2013; Mortara et. al, 2014; Rouse et. al, 2015; Sicart, 2011). MR provides a uniquely layered approach working across physical and digital artifacts and spaces, encouraging polysemic experiences that can support curators’ and historians’ desires to tell ever more complex and connected stories for museum and heritage site visitors, even involving visitors’ own voices in new ways. In combination, critical game design approaches and MR technologies, within the museum context, help re-center historical experience on the visitor’s body, voice, and agency, shifting emphasis away from material objects, also seen as static texts or sites for one-way, broadcast information. Re-centering the design on users’ embodied experience with critical play in mind, and in MR settings, offers rich scholarship for DH studies and provides a variety of heritage, museum, entertainment, and participatory design examples to enrich the field of study for open, future and forward thinking. Drawing on examples from heritage games developed within my university research group and in the heritage design network I co-founded, and implemented in museum and heritage sites, I will work to expose these connections. From transmedial children’s books focused on Nordic folktales, to playful AR experiences that expose the history of architectural achievements, as well as the meta reflections on the telling of those achievements in archival documentations (such as the development of the Brooklyn Bridge in the 19th C) I will provide an overview of how digital heritage games, in combination with new hybrid DH initiatives can be used for future development and research. This includes research around new digital literacies, collaborative and co-design approaches (with users) and experimental storytelling and narrative approaches for locative engagement in open-world settings, dependent on input from user/visitors. References Anderson, E. F., McLoughlin, L., Liarokapis, F., Peters, C., Petridis, P., de Freitas, S. Developing Serious Games for Cultural Heritage: A State-of-the-Art Review. In: Virtual Reality 14 (4). (2010) Burdick, A., Drucker, J., Lunenfeld, P., Presner, T., Schnapp, J. Digital_Humanities. MIT Press, Cambridge, MA (2012) Bogost, I. Persuasive Games: The Expressive Power of Videogames. MIT Press, Cambridge MA (2010) Flanagan, M. Critical Play: Radical Game Design. MIT Press, Cambridge MA (2013) Gold, M. K. Debates in the Digital Humanities. University of Minnesota Press, Minneapolis, MN (2012) Hayles, K. N. How We Think: Digital Media and Contemporary Technogenesis. Chicago, University of Chicago Press, Chicago Il (2012) Parry, R. The End of the Beginning: Normativity in the Postdigital Museum. In: Museum Worlds: Advances in Research, vol. 1, pgs. 24-39. Berghahn Books (2013) Mortara, M., Catalano, C.E., Bellotti, F., Fiucci, G., Houry-Panchetti, M., Panagiotis, P. Learning Cultural Heritage by Serious Games. In: Journal of Cultural Hertiage, vol. 15, no. 3, pp. 318-325. (2014) Rouse, R., Engberg, M., JafariNaimi, N., Bolter, J. D. (Guest Eds.) Special Section: Understanding Mixed Reality. In: Digital Creativity, vol. 26, issue 3-4, pp. 175-227. (2015) Sicart, M. The Ethics of Computer Games. MIT Press, Cambridge MA (2011) 4:45pm - 5:00pm
Short Paper (10+5min) [publication ready] Researching Let’s Play gaming videos as gamevironments University of Helsinki Let’s Plays, as a specific form of gaming videos, are a rather new phenomenon and it is not surprising that they are still relatively under-researched. So far, only a few publications focus on the theme. The specifics of Let’s Play gaming videos make them an unparalleled object of research in the vicinity of games – in the so-called gamevironments. The theoretical and methodical approach of the same name and literally merging the terms “games/gaming” – “environments” is first mentioned and discussed by Radde-Antweiler, Waltemathe and Zeiler 2014 who argue to broaden the study of video games, gaming and culture beyond media-centred approaches to better highlight recipient perspectives and actor-centred research. Gamevironments thus puts the spotlight on actors in their mediatized – and specifically gametized – life. 5:00pm - 5:15pm
Short Paper (10+5min) [abstract] The plague transformed: City of Hunger as mutation of narrative and form Ocean County College, United States of America, This short paper proposes and argues the hypothesis that Minna Sundberg’s interactive game in development, City of Hunger, an offshoot or spin-off of her well respected digital comic, Stand Still Stay Silent, can be understood in terms of the ecology of the comic as a mutation of it; as such, her appropriation of a classic game genre and her storyline’s emphasis on the mechanical over the natural suggest promising avenues for understanding the uses of interactivity in the interpretation of narrative. In the game, the plague-illness of the comic’s ecology may or may not be gone, but conflict (vs. cooperation) becomes the primary mode of interaction for characters and reader-players alike. In order to produce the narrative, the reader-player will have to do battle as the characters do. Sundberg herself signals that her new genre is indivisible from the different ecology of the game world’s narrative. “City of Hunger will be a 2d narrative rpg with a turn-based battle system, mechanically inspired by your older final fantasy games, the Tales of-series and similar classical rpg's.” There will be a world of “rogue humans, mechanoids and mysterious alien beings to fight” (2017). While it remains to be seen how the game develops, its emphasis on machine-beings and aliens in a classic game environment ( a “shadow of the past”) suggests strongly that the use of interactivity within each narrative has an interpretive and not merely performative dimension. 5:15pm - 5:30pm
Short Paper (10+5min) [abstract] Names as a Part of Game Design University of Helsinki, Video games often consist of several separate spaces of play. They are called, depending on the speaker and the type of the game, for example levels, maps, tracks or worlds. In this paper, the term level is used. As there are usually many levels in a game, they need some kind of identifying elements. In some games, levels only have ordinal numbers (Level 1, Level 2 etc.), but in the other, they (also) have names. Names are an important part of game design, at least for three reasons. Firstly, giving names to places makes the imaginary world feel richer and deeper (Schell 2014: 351), improving the gameplay experience. Secondly, name gives the player first impression of the level (Rogers 2014: 220), helping him/her to perceive the level’s structure. And thirdly, level names are needed for discussing the levels. Members of a gaming community often want to share their experiences and emotions of the gameplay. When doing so, it is important to contextualize the events: in which level did X happen? Even though some game design scholars seem to recognize the importance of names, there are very few studies of them. This presentation is aimed to fill this blank. I have analyzed level names in Playforia Minigolf, an online minigolf game designed in Finland in 2002. The data include names all the 2,072 levels in the game. The analysis focuses especially on the principles of naming, or in other words, what kind of connection there is between the name and level’s characteristics. The presentation also examines the change of naming practices during the game’s 15-year history. The oldest names mostly describe the levels in a simple, neutral manner, while the newest names are far more ambiguous and rarely have anything to do with level’s characteristics. This change is probably caused by the change of level designers. First levels of the game were designed by its developers, game design professionals, but over time, the responsibility of designing levels has passed to the most passionate hobbyists of the game. This result might be an interesting for game studies and especially for the research of modding and modifications (see e.g. Unger 2012). REFERENCES Playforia (2002). Minigolf. Finland: Apaja Creative Solutions Oy. Rogers, Scott (2014). Level Up! The Guide to Great Video Game Design. Chichester: Wiley. Schell, Jesse (2014). The Art of Game Design: A Book of Lenses. CRC Press. Unger, Alexander (2012). Modding as a Part of Gaming Culture. – Fromme, Johannes & Alexander Unger (eds.): Computer Games and New Media Cultures. A Handbook of Digital Games Studies, 509–523. |
5:30pm - 8:00pm | DHN2018 closing party Think Corner |