Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
Session Overview
Session
T-PIII-2: Language Resources
Time:
Thursday, 08/Mar/2018:
2:00pm - 3:30pm

Session Chair: Kaius Sinnemäki
Location: PIII

Show help for 'Increase or decrease the abstract text size'
Presentations
2:00pm - 2:30pm
Long Paper (20+10min) [publication ready]

Sentimentator: Gamifying Fine-grained Sentiment Annotation

Emily Sofi Öhman, Kaisla Kajava

University of Helsinki

We introduce Sentimentator; a publicly available gamified web-based annotation platform for fine-grained sentiment annotation at the sentence-level. Sentimentator is unique in that it moves beyond binary classification. We use a ten-dimensional model which allows for the annotation of over 50 unique sentiments and emotions. The platform is heavily gamified with a scoring system designed to reward players for high quality annotations. Sentimentator introduces several unique features that have previously not been available, or at best very limited, for sentiment annotation. In particular, it provides streamlined multi-dimensional annotation optimized for sentence-level annotation of movie subtitles. The resulting dataset will allow for new avenues to be explored, particularly in the field of digital humanities, but also knowledge-based sentiment analysis in general. Because both the dataset and platform will be made publicly available it will benefit anyone and everyone interested in fine-grained sentiment analysis and emotion detection, as well as annotation of other datasets.

Öhman-Sentimentator-226_a.pdf
Öhman-Sentimentator-226_c.pdf

2:30pm - 2:45pm
Distinguished Short Paper (10+5min) [publication ready]

Defining a Gold Standard for a Swedish Sentiment Lexicon: Towards Higher-Yield Text Mining in the Digital Humanities

Jacobo Rouces, Lars Borin, Nina Tahmasebi, Stian Rødven Eide

University of Gothenburg

There is an increasing demand for multilingual sentiment analysis, and most work on sentiment lexicons is still carried out based on English lexicons like WordNet. In addition, many of the non-English sentiment lexicons that do exist have been compiled by (machine) translation from English resources,

thereby arguably obscuring possible language-specific characteristics of sentiment-loaded vocabulary.

In this paper we describe the creation of a gold standard for the sentiment annotation of Swedish terms as a first step towards the creation of a full-fledged sentiment lexicon for Swedish -- i.e., a lexicon containing information about prior sentiment (also called polarity) values of lexical items (words or disambiguated word senses), along a scale negative--positive. We create a gold standard for sentiment annotation of Swedish terms, using the freely available SALDO lexicon and the Gigaword corpus. For this purpose, we employ a multi-stage approach combining corpus-based frequency sampling and two stages of human annotation: direct score annotation followed by Best-Worst Scaling. In addition to obtaining a gold standard, we analyze the data from our process and we draw conclusions about the optimal sentiment model.

Rouces-Defining a Gold Standard for a Swedish Sentiment Lexicon-209_a.pdf
Rouces-Defining a Gold Standard for a Swedish Sentiment Lexicon-209_c.pdf

2:45pm - 3:00pm
Short Paper (10+5min) [publication ready]

The Nordic Tweet Stream: A dynamic real-time monitor corpus of big and rich language data

Mikko Laitinen1, Jonas Lundberg2, Magnus Levin3, Rafael Martins4

1University of Eastern Finland; 2Linnaeus University; 3Linnaeus University; 4Linnaeus University

This article presents the Nordic Tweet Stream (NTS), a cross-disciplinary corpus project of computer scientists and a group of sociolinguists interested in language variability and in the global spread of English. Our research integrates two types of empirical data: We not only rely on traditional structured corpus data but also use unstructured data sources that are often big and rich in metadata, such as Twitter streams. The NTS downloads tweets and associated metadata from Denmark, Finland, Iceland, Norway and Sweden. We first introduce some technical aspects in creating a dynamic real-time monitor corpus, and the fol-lowing case study illustrates how the corpus could be used as empirical evidence in sociolinguistic studies focusing on the global spread of English to multilingual settings. The results show that English is the most frequently used language, accounting for almost a third. These results can be used to assess how widespread English use is in the Nordic region and offer a big data perspective that complement previous small-scale studies. The future objectives include annotating the material, making it available for the scholarly community, and expanding the geographic scope of the data stream outside Nordic region.

Laitinen-The Nordic Tweet Stream-201_a.pdf
Laitinen-The Nordic Tweet Stream-201_c.pdf

3:00pm - 3:15pm
Short Paper (10+5min) [publication ready]

Best practice for digitising small-scale Digital Humanities projects

Peggy Bockwinkel, Dîlan Cakir

University of Stuttgart, Germany

Digital Humanities (DH) are growing rapidly; the necessary infrastructure

is being built up gradually and slowly. For smaller DH projects, e. g. for

testing methods, as a preliminary work for submitting applications or for use in

teaching, a corpus often has to be digitised. These small-scale projects make an

important contribution to safeguarding and making available cultural heritage, as

they make it possible to machine read those resources that are of little or no interest

to large projects because they are too special or too limited in scope. They

close the gap between large scanning projects of archives, libraries or in connection

with research projects and projects that move beyond the canonised paths.

Yet, these small projects can fail in this first step of digitisation, because it is

often a hurdle for (Digital) Humanists at universities to get the desired texts digitised:

either because the digitisation infrastructure in libraries/archives is not

available (yet) or it is paid service. Also, researchers are often no digitising experts

and a suitable infrastructure at university is missing.

In order to promote small DH projects for teaching purposes, a digitising infrastructure

was set up at the University of Stuttgart as part of a teaching project. It

should enable teachers to digitise smaller corpora autonomously.

This article presents a study that was carried out as part of this teaching project.

It suggests how to implement best practices and on which aspects of the digitisation

workflow need to be given special attention.

The target group of this article are (Digital) Humanists who want to digitise a

smaller corpus. Even with no expertise in scanning and OCR and no possibility

to outsource the digitisation of the project, they still would like to obtain the best

possible machine-readable files.

Bockwinkel-Best practice for digitising small-scale Digital Humanities projects-254_a.pdf
Bockwinkel-Best practice for digitising small-scale Digital Humanities projects-254_c.pdf

3:15pm - 3:30pm
Distinguished Short Paper (10+5min) [publication ready]

Creating and using ground truth OCR sample data for Finnish historical newspapers and journals

Kimmo Kettunen, Jukka Kervinen, Mika Koistinen

University of Helsinki, Finland,

The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12 million pages mainly in Finnish and Swedish. Out of these about 5.1 million pages are freely available on the web site digi.kansalliskirjasto.fi. The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1920. The last ten years, 1911–1920, were opened in February 2017.

This paper presents the ground truth Optical Character Recognition data of about 500 000 Finnish words that has been compiled at NLF for development of a new OCR process for the collection. We discuss compilation of the data and show basic results of the new OCR process in comparison to current OCR using the ground truth data.

Kettunen-Creating and using ground truth OCR sample data for Finnish historical newspapers and journals-115_a.pdf
Kettunen-Creating and using ground truth OCR sample data for Finnish historical newspapers and journals-115_c.pdf


 
Contact and Legal Notice · Contact Address:
Conference: DHN 2018
Conference Software - ConfTool Pro 2.6.122
© 2001 - 2018 by Dr. H. Weinreich, Hamburg, Germany