Workshop 1 (Python).

Language Processing and Visualization with Python I

Uldis Bojārs
Valdis Saulespurēns

July 26

Python has grown to being the 3rd most popular programming language in the world according to the renowned TIOBE language index. Most of the popularity and growth have been due to Python’s ease of use and accessibility to professionals across various fields unrelated to computer science.
Python has been called the "glue language" that holds together various components of IT systems. It also comes with a variety of built-in libraries ("batteries included") and has many external libraries for tasks such as Natural Language Processing (NLP), machine learning, and data visualization.
Introduction to Python:

  • Using Jupyter Notebooks for interactive coding
  • Basics of Python programming language:
    Python data structures
    program flow
    basic functions
  • Importing data in Python
  • Text pre-processing, tokenization, lemmatization

Following Python libraries will be used: nlkt, pandas, spacy, plotly, and scikit-learn

Valdis Saulespurens is a programming instructor at Riga Coding School. He teaches adult professionals new to programming Python and Javascript among other skills. Valdis specializes in Data Analysis and Web Scraping. He enjoys wrangling unruly data into structured knowledge. Valdis has over 20 years programming experience. He wrote his first professional programs for quantum scientists at University of California at Santa Barbara. Prior to teaching he wrote software for a radio broadcast equipment manufacturer. He holds a Master's degree in Computer Science from University of Latvia. When not spending time with his family Valdis likes to bike and play chess, sometimes at the same time.

Workshop 2 (Python).

Language Processing and Visualization with Python II

Uldis Bojārs
Valdis Saulespurēns

July 27

Prior basic knowledge about Python (or participation in the workshop Language Processing and Visualization with Python I) is recommended.
As we progress in the course, we will overview popular text processing and machine learning algorithms related to NLP.  We will discuss:

  • Analyzing data, text analysis
  • Elements of machine learning
  • Data export and visualization

Following Python libraries will be used:
nlkt, pandas, spacy, plotly, and scikit-learn

Workshop 1 (R).

Basics of R for Data Analysis and Visualization
Importing, exploring and visualizing multilingual data in R

Andres Karjus

July 26

The first workshop is an introduction – or a refresher, if you've dabbled – to R, an easy to learn, versatile programming language widely used across data science and academia. There will be minimal lecturing involved, as the mode here is learning by doing. Students will be guided through various exercises, while being encouraged to collaborate and discuss possible solutions. The examples focus mostly on exploring and visualizing data, using tidyverse packages such as dplyr and ggplot2. After first getting comfortable with using neat, clean and English-language datasets, we will look into working with more messy data, and also data in languages other than English, on the example of a historical Latvian publications database.

Andres Karjus is a computational linguist and computational humanist (PhD, University of Edinburgh 2020). He studies language and culture using a combination of text corpora, computational simulations and human experiments. All these approaches produce lots of information, usually too much to analyze qualitatively - this is where careful application of machine learning and rigorous statistical modelling can help to make sense of (and make predictions based on) the data. He is currently engaged in a number of projects, on changes in language in social and traditional media, quantifying visual art complexity, and on dynamics of television and film festival programming.

Workshop 2 (R).

Basics of R for Data Analysis and Visualization
Working with (non-English) text in R

Andres Karjus

July 27

This workshop is all about text, and not just text in English and the Latin alphabet. We will go over various corpus formats and how to deal with them, including Unicode issues and working with data that is larger than would fit into working memory. We will also look into some simple natural language processing tasks, such as searching with regular expressions, stemming, lemmatization, topic modelling and using pretrained word embeddings. This workshop assumes either participation in the first workshop the day before, or some prior knowledge of R.

Workshop 3. 

Web Data Harvesting 

Marija Isupova

July 28

In order to analyse large amounts of data, we need to collect and store it first. This workshop will focus on collecting data from the Web using several methods: Web Scraping with Python, API to get access to the data directly, and different tools that help automate this process. During the workshop, we will extract data from social media and social messaging platforms (e.g. Twitter and Telegram) and news articles. We will look into analysing the harvested data, using the
acquired knowledge from prior either R or Python workshops.

Marija Isupova is a software engineer at NATO StratCom COE, where she is researching methods of understanding the information environment landscape in online and social media. She has a statistics and data analytics background using Python and R. She holds a Master's degree in Computer Science from the University of Latvia.

Workshop 4.

Network Visualisation – Sense-Making Through Design and Aesthetics

Noemi Chow, Fidel Thomet

July 29

This workshop kicks off with an introduction to data visualisation basics, followed by a deep dive into network visualisation. We will look into approaches to visualising entities and relations (or nodes and edges) and discuss the reasons for visualising complex networks and their aesthetics.
In the hands-on part of the workshop, we will first focus on sense-making. You will create your own information network using analogue materials. Based on that, you will investigate the network's narrative qualities by using it as a storytelling device.

Noemi Chow is a scientific illustrator. In her research work in the field of Knowledge Visualisation at the Zurich University of the Arts, she deals with mediation through visualisation, immersive and three-dimensional media.

Fidel Thomet is an interaction designer working on data visualisation, investigative interfaces, and speculative futures. He is a research associate at uclab, a visualisation research group situated between design, computing, and the humanities at the University of Applied Sciences Potsdam. Before that, he worked for the City of Zurich’s statistical office, the Aargauer Kunsthaus, as a Google News Lab Fellow for the Frankfurter Allgemeine Zeitung.


Machine Learning to Read Yesterday’s News. How Semantic Enrichments Enhance the Study of Digitised Historical Newspapers

Marten Düring

July 26

Newspapers count among the most attractive sources for historical research. Following mass digitisation efforts over the past decades, researchers now face the problem of overabundance of materials which can no longer be managed with keyword search and basic content filtering techniques alone even though only a fraction of the overall archival record has actually been made available. This poses challenges for the contextualisation and critical assessment of these sources which can be effectively addressed using semantic enrichments based on natural language processing techniques. In this lecture we will discuss epistemological challenges in data exploration and interface design as well as opportunities in terms of source criticism and content exploration, based on the impresso interface.

Marten Düring is an Assistant Professor in Digital History at the Luxembourg Centre for Contemporary and Digital History (C2DH) and holds a PhD in contemporary history. His research is positioned on the intersection between historical thinking, novel computational methods, and software design. In his ongoing work prof. Düring coordinates the C²DH-based team of the impresso project for the exploration of semantically enriched historical newspapers, works as a founding editor on the Journal of Historical Network Research, coordinates the Historical Network Research Community, and contributes to the DHARPA project.

Fine-Tuning the Historian's Macroscope: Data Reuse and Medieval Korean Biographical Records in Neo4j

Javier Cha

July 27

This lecture proposes that historians create personal libraries tailored to their projects rather than engage in macro-level “distant reading” of a centralized repository. This methodological intervention prioritizes contextualization and authentication in data-driven historical research, which technologically translates into robust management, connecting, and querying of records culled from a variety of pertinent databases. After experimenting with the standard set of general-purpose “macroscopes” comprised of SQL, Gephi, and Cytoscape, I present a new workflow that utilizes iterative Python code to extract data subsets from universal databases and imports them into the graph databasement management software Neo4j. I discuss why I make certain choices and explain how digital historians can use a macroscope powered by Neo4j to zero in on potentially insightful fields of view.

Javier Cha is Associate Professor at Seoul National University and the principal investigator of the Big Data Studies Lab, which approaches data centers and the global telecommunications infrastructure similarly to how a medieval book historian would explore the material bibliography of manuscripts and libraries. As an intellectual historian of medieval Korea and a technologist, Cha has been active in the digital humanities community for fourteen years. Cha is the recipient of the prestigious Innovative and Pioneering Research Scheme, which provides financial support for his digital humanities research lab. He serves on the editorial boards of the International Journal of Humanities and Arts Computing and Cursor Mundi as well as the international nominations committee for Digital Humanities Awards.

Digital History Between Measuring and Interpreting

Jani Pekka Marjanen

July 27

Working with digitized data is liberating. We can suddenly do things that would have been either impossible or too time consuming before. Still, in our liberated state, we need to be extra careful in thinking about what it is our new methods actually measure and how we can interpret those results. In this talk, I will present three case studies that use historical digitized newspapers to make historical arguments. The first of them uses topic models, the second is based on word embeddings and the third on simple bigram counts. Through these cases, I will discuss the transparency of different methods, and how they make it more or less difficult to communicate to a reader what is being measured and where humanistic interpretation starts. I will argue that machine-learning methods are sometimes better for exploration and for identifying themes for qualitative analysis, whereas count-based methods can be more useful for analyzing quantitative trends in data.

Dr. Jani Marjanen is a senior researcher at the University of Helsinki. He is a historian specializing in the history of patriotism and nationalism, the history of ideology, the history of newspapers and book printing, and the theory and methodology of conceptual history. In his work, he combines traditional historical inquiry with new methods from the digital humanities. A list of his publications can be found here:

There is No Journalism Without Data Journalism

Raivis Vilūns

July 28

While sometimes data journalism is described as one of the types of journalistic content, the truth is - any good news journalism material can and quite possibly should be either born out of a phenomenon hidden in the data or strengthened by data. In the era of fake news, data can be a powerful tool for any type of journalist to contextualize and explain the truth - from hard-hitting investigations into corruption and abuses of power to sports stories and soft news. The past, present, and future of journalism are tied to data.

Raivis Vilūns has been working in various Latvian media outlets for more than ten years. He has covered a wide range of topics - from feature stories about science and history to investigative and analytical reporting about municipality governance and political events. Now he works for the most read news website in Latvia - where he writes about business, economics, and the financial world. Vilūns is the author of a weekly satirical webcomic published in the "Delfi Business". When not writing or drawing he is working on his doctoral thesis about online media and journalism in Latvia.