Workshops

Workshop 1. Language processing and data science with Python

Uldis Bojārs
Valdis Saulespurēns

July 23, 11.40 - 16.50

SUMMARY: The course assumes no previous knowledge of Python programming language.
Python has grown to being the 3rd most popular programming language in the world according to the renowned TIOBE language index. Most of the popularity and growth has been due to Python’s ease of use and accessibility to professionals across various fields unrelated to computer science.
Python has been called the "glue language" that holds together various components of IT systems. It also comes with a variety of built-in libraries ("batteries included") and has many external libraries for tasks such as Natural Language Processing (NLP), machine learning and data visualization.
As an introduction to Python the following will be discussed:

  • setting up a working Python environment with Jupyter notebooks
  • Python Data Structures
  • Program Flow
  • basic Functions.

As we progress in the course, we will overview popular text processing and machine learning algorithms related to NLP.  We will discuss:

  • important Python libraries for analyzing textual content
  • importing and wrangling data
  • analyzing data
  • data export and visualization.

The participants will be able to follow the course interactively on their own computers.

Valdis Saulespurēns is a programming instructor at Riga Coding School. He teaches adult professionals new to programming Python and Javascript among other skills. Valdis specializes in Data Analysis and Web Scraping. He enjoys wrangling unruly data into structured knowledge. Valdis has over 20 years programming experience. He wrote his first professional programs for quantum scientists at University of California at Santa Barbara. Prior to teaching he wrote software for a radio broadcast equipment manufacturer. He holds a Master's degree in Computer Science from University of Latvia. When not spending time with his family Valdis likes to bike and play chess, sometimes at the same time.

Workshop 2. Text encoding for digital scholarly editions

Wout Dillen

July 24, 09.15 - 14.45 

SUMMARY: This workshop aims to introduce the theories and practices of digital scholarly editing to newcomers to the field of Digital Humanities. After setting up a theoretical framework that places scholarly editing in the larger field of textual scholarship, participants will learn how to use the Text Encoding Initiative’s guidelines to transcribe and encode cultural heritage materials such as literary and historical documents in TEI-XML. To do so, participants will be taught the basics of the XML markup, and briefly introduced to a series of related technologies that will help prepare these encoded transcriptions for the web – such as XSLT (passive reading skills), HTML, and more generic XML publication solutions such as the TEI Publisher.

Wout Dillen is the coordinator of the University of Antwerp (Belgium)’s division of CLARIAH-VL – the Flemish contribution to DARIAH-EU (the European Digital Research Infrastructure for the Arts and Humanities). He completed his PhD on Digital Scholarly Editing for the Genetic Orientation at the University of Antwerp in 2015, and was the experienced researcher in the Marie-Curie ITN DiXiT (Digital Scholarly Editing Initial Training Network) at the University of Borås, Sweden (2016-2017). He is the secretary of the European Society for Textual Scholarship (ESTS) and the general editor of its journal Variants from issue 15 onwards; as well as a board member of DH Benelux and co-editor of its upcoming Journal of DH Benelux. He is also on the editorial board of RIDE – a review journal for digital scholarly editions and resources.

Workshop 3. Distributional semantics and topic modeling: theory and application

Christof Schöch

July 25, 09.15 - 14.45

SUMMARY: In recent years, methods of text analysis based on the paradigm of Distributional Semantics have become hugely popular not only in the Digital Humanities. This workshop will first introduce the participants to the fundamentals of Distributional Semantics as well as to several methods based on this paradigm, particularly Topic Modeling and Word Embeddings. The workshop will then focus on how to practically implement the workflow required to perform Topic Modeling, including data preprocessing, the actual modeling of the data, postprocessing and visualization of results. We will work with several sample datasets, mostly using libraries from the Python programming language. Ultimately, the workshop aims to enable participants to use Topic Modeling to pursue their own research interests and analyze their own collections of textual data.

Christof Schöch is Full Professor of Digital Humanities at the University of Trier, Germany, and Co-Director of the Trier Center for Digital Humanities (TCDH). Also, he is mentor of the early-career research group Computational Literary Genre Stylistics (CLiGS) at the University of Würzburg, chair of the COST Action Distant Reading for European Literary History, and president of the Association for German-speaking Digital Humanities (DHd). Christof’s interests in research and teaching are located at the confluence of French literary studies and the Digital Humanities. His methodological focus is on quantitative methods of text analysis and on building digital textual resources. He is also interested in new forms of scholarly writing and collaboration and pleads for Open Access to publications and research data. Find out more at https://christof-schoech.de/en

Workshop 4. Metadata analysis and visualisation with Palladio

Meliha Handzic

July 26, 09.30 - 14.15

SUMMARY: The purpose of this practical exercise is to learn how to use the Palladio tool - a web-based platform for the visualization of complex, multi-dimensional data. Palladio will be used to perform metadata analysis and visualisation of interest and interpret their results. Firstly, the participants will be given a short introduction to the tool, its features and capabilities. This will be followed by a brief description of a set of metadata selected for the analysis. Then, the participants will perform a series of analytical and visualisation tasks with Palladio. These will involve spatial, temporal and network analyses and visualisations of a given set of metadata. Finally, the participants will interpret the resulting maps and graphs to gain insights and improve their understanding of the data investigated.

Meliha Handzic is Professor of Management and Information Systems at International Burch University, Sarajevo. Her PhD in Information Systems is from the University of New South Wales, Sydney. Handzic’s main teaching and research interests lie in the area of knowledge management. She has published widely on this topic in books, journals and conference proceedings. Presently, Handzic is an active member of several professional societies and serves on editorial boards, executive and program committees for numerous international and national journals and conferences. Prior to joining academia, Handzic was International Expert in Information Systems for the United Nations Development Programme in Asia and Africa. She also had a wide-ranging industrial experience in Europe.

Lectures

Understanding neural machine translation

Mārcis Pinnis

July 25, 16.30 - 18.00

SUMMARY: The lecture will be structured in three parts – introduction, understanding, and development. The first part will explain the different purposes of machine translation systems and their application areas. It will also discuss the main generations of machine translation systems and provide details on their differences. The second part will explain how neural machine translation systems work. That is, how does a mathematical model that operates with numbers translate words and whole sentences. The final part of the lecture will dig deeper into the neural machine translation system development process. It will show what processes are involved in the development workflow and provide examples to showcase why the different processes are important in order to achieve high-quality machine translation results.

Mārcis Pinnis is a researcher working on natural language processing in Tilde. Mārcis received his master’s degree from the University of Cambridge (St. Edmund’s College) in 2009 and his PhD degree from the University of Latvia in 2015. The topic of his thesis was “Terminology Integration in Statistical Machine Translation”. His current research interests are neural machine translation, terminology integration, domain adaptation, and hybrid methods. Mārcis has been the lead researcher behind the work on Tilde’s neural machine translation systems for English-Latvian-English and English-Estonian-English that achieved the best results at the international shared task on news translation at the Conference on Machine Translation in 2017 and 2018. The work has been named by the Latvian Academy of Sciences as one of the main achievements of Latvian science in 2018.


Using eye-tracking on researching reading behavior on paper and screen

Arūnas Gudinavičius

July 24, 15.00 - 16.30 

SUMMARY: Paper and screens each afford their own types of processing. In today's reading environment of paper and screens, we will need to find the best ways to utilize the advantages of both paper and digital technologies across age groups and purposes. Current research shows that paper remains the preferred reading medium for longer single texts, especially when reading for deeper comprehension and retention, and that paper best supports long-form reading of informational texts. Eye tracking is one of many powerful tools we can use for researching reading behavior. Using eye tracking could add new ways of getting data for further research in digital humanities.

Arūnas Gudinavičius is currently associate professor and researcher at the Digital Media Lab in Faculty of Communication, Vilnius University. Since 1999 he has been working in one of Lithuania's leading digital publishing houses. Now he is managing director at Vilnius University Press. In 2012 he completed his PhD in Information and Communication Sciences. He is also working as a consultant to e-book publishers in Lithuania. His research interests are digital books, digital publishing, e-reading, human-computer interaction. In the past years he participated in initiating and implementing some of the national scientific projects based on eye-tracking research methodology. Member of E-READ and Distant Reading COST Actions. He can be contacted at [email protected].
 

The many uses of digitized newspapers

Clemens Neudecker

July 23, 09.50 - 11.20

SUMMARY: The talk will provide an overview of the landscape of digitized newspapers throughout Europe, showcasing various large collections with their main features, availability and collection characteristics. Following up from there, it will trace some exemplary digital humanities research efforts that work with digitized newspaper collections, introducing the context and setup of the investigation, with the chosen methodologies as well as the challenges and successes encountered.

Clemens Neudecker works as Research Advisor in the Directorate General of the Berlin State Library (Staatsbibliothek zu Berlin - Preussischer Kulturbesitz). Formerly he worked at the Bavarian State Library and the National Library of the Netherlands as coordinator of the Europeana Newspapers project (2012-2015), Technical Manager of the IMPACT project (2008-2012), Work Package Leader of the projects SUCCEED and SCAPE and the Early Modern OCR Project. He is a founder of the KB Research Lab. His main interests are OCR, Machine Learning and Digital Humanities.

A digital humanities approach to historical sociolinguistics of Estonian

Peeter Tinits

July 25, 15.00 - 16.30

SUMMARY: A sociolinguistic approach to language history suggests that in order to understand why languages now are the way they are, we need to look at the details of how the language was used by whom and why in the past communities. These practices can be quite diverse and a good overview is difficult to establish. I will talk about how well-formed digital collections can help greatly with this endeavor. Specifically, I will describe how I studied the mechanisms of spelling standardization around 1870-1930 by combining a number of different data sources with various degrees of structure at the start. I combined and assembled a large collection of digitized texts, bibliographic metadata, biographic metadata, demographic data as well as data on language prescription and education to better understand the general trends in the community. I will talk through the process of assembling these databases and show the results that can be achieved when such databases are combined. I will underline the benefits that making such public datasets can give to the scholarly community as well as our understanding of the past, whether on language use or some other topic.

Peeter Tinits currently works on text mining techniques to study the history of technology based on newspaper texts in University of Tartu, Estonia. His PhD work in Tallinn University has been on the spelling standardization of Estonian in around 1870-1930, looking at the sociolinguistic factors involved and the influence of language prescription. He has worked on various different digital humanities research projects ranging from the history of Wikipedia to the cultural evolution of films with both computational and experimental tools. He has a special interest in the use of open research practices in humanities.