A unique corpus for Latvian language learners is being created in the Laboratory of Artificial Intelligence

December 7, 2020 at 8:21 am

Since September 2018, the Latvian Language Learners' Corpus (LaVA; http://lava.korpuss.lv/) has been worked on by the Artificial Intelligence Laboratory of the Institute of Mathematics and Informatics of the University of Latvia (LU MII AiLab). It will be a new basis for the study of the peculiarities of Latvian language acquisition, for the quantitative and qualitative analysis of the mistakes made by language learners. Also, taking into account the mistakes of learners and the influence of the mother tongue, methodological materials for language learning will be developed.

LaVA includes works by foreign students studying at a Latvian higher education institution who are learning Latvian as a foreign language in the first or second semester. The texts have been created in the study process and have been obtained from Rīga Stradiņš University, the University of Latvia, the Liepāja University, the Rēzekne Academy of Technologies and the Latvian Academy of Culture. The corpus is expected to consist of approximately 1,000 student papers and 100,000 vocabulary.

Project manager, leading researcher Ilze Auziņa: “In the last 15–20 years, language builder corpora have become very popular – researchers use them to study the impact of the mother tongue on foreign language learning, as well as the language learning process in general, they also help plan the learning process. At present, the field of language learners' corpora is dominated by English, however, other language learners' corpora are also being formed, such as German, Portuguese and Russian language learners' corpora. LaVA is now being set up, whose data will be used to develop online assignments and self-tests. ”

A language corpus is a structured set of texts or transcripts of speech intended for linguistic analysis and the development of language technologies. It includes authentic language material that reflects the actual use of the language. The language learners' body contains systematic data on language learners – texts and / or decoded audio files, which usually also mark the mistakes made by language learners.

The corpus of Latvian language learners is being formed in the Fundamental and Applied Research project “Development of the Latvian language learners corps: methods, tools and use” (No. lzp-2018 / 1-0527).

LU MII AiLab is one of the most important organisations in Latvia, which has been engaged in research in computer linguistics and language technologies for 28 years. The laboratory conducts research in various areas of automated natural language processing and machine learning, develops machine-readable dictionaries (the most popular of which is Tēzaurs.lv) and machine-readable speech and text corpora (Korpuss.lv).

The information was prepared by Kristīne Pokratniece, AiLab