A new range of ideas to enhance language tools

Published on June 15, 2023

Using AI to automatically subtitle a video in Luxembourgish, assess your pronunciation of Luxembourgish terms or extend the functions of lod.lu: here are some ideas that were prototyped during the 2023 edition of the Open Data Hackathon. Watch our video summary.

Accéder à la version française

The Open Data Hackathon x '{"lang": "lb"}' took place on Thursday 8 and Friday 9 June. Twenty-five developers, UX designers, project managers and data scientists came together at the GovTech Lab to develop innovative tools.

Co-creation and multidisciplinarity were at the heart of two intense days of fruitful exchanges, far from any competitive spirit. Six projects were presented and discussed. What they have in common is that they are all fed by the linguistic database of the Zenter fir d'Lëtzebuerger Sprooch (ZLS), including the Luxemburgish dictionary LOD and the data collected by the schreifmaschinn.lu project. Here is a summary of each team's work, presented in a short video. Has a project won you over? Would you like to take it further? You can find a selection of re-uses at the bottom of this article.

AI to subtitle a video in Luxembourgish

The idea is to offer automatic translation, as YouTube does today, but for content in Luxembourgish. To achieve this, the team has developed a Python program that accepts an audio file - or retrieves the audio track from a video file - and submits it to the schreifmaschinn API, which delivers a first transcription.

This transcription is then sent to ChatGPT, which translates it into other languages. The texts are divided into three-second blocks, a sequencing adapted to subtitling needs. ChatGPT delivers subtitles in a ready-to-use srt file. The video, enhanced with subtitles in four languages, is then automatically published on YouTube.

ChatGPT offers an acceptable translation without any knowledge of Luxembourgish. It is based on similarities, depending on the pivot language used. This approach requires manual verification. There is room for improvement, but the idea is very promising.

AI to assess your pronunciation of Luxembourgish

The principle here is to display a word, randomly, among those available in the LOD dictionary. Using its API, the page presents the word, its phonetics and its French and English translations. A button invites you to record your voice. The audio is converted into phonetics, which are then converted into syllables. The phonetics of the recorded word are displayed to show whether the comparison was a success or, on the contrary, which phonemes showed a difference.

This application is designed to be both playful and educational. It still needs to be perfected, but it has great potential, which could eventually appeal to teaching teams.

The "Categories" word game, Luxembourg-style

Still in the edutainment field, a third team was inspired by the "Categories" or "Petit bac" game, consisting in finding, within a limited time, a series of words belonging to a predefined category (an animal, a profession, the name of a town, etc.), all of them beginning with the same letter. The lod.lu API provides a category for each word. Several clues are offered to the user who 'dries up': a translation of a possible word in English, a synonym (but whose first letter is not the one sought), a sentence where the (hidden) word is found in context, an image retrieved from an online service to which the system submits the English translation of the word, and finally the audio transcription of the word to be filled in. The system weights the score according to the clues used.

The development of this encouraging prototype would make sense alongside existing apps aimed at learning Luxembourgish.

Pages peppered with Luxembourgish words

In the same educational vein, but this time in a more invasive way, a team looked at improving the LëtzRead browser extension. It automatically replaces certain words on an English-language web page with their Luxembourgish equivalents.

The words to be replaced are selected according to their complexity. The underlying idea is to translate words that are considered difficult. Several parameters are used to determine this complexity, taking into account the length of the word, the number of times the translation of a term has been requested, the multiple meanings of the same term, its inclusion in the basic vocabulary or the similarity of a Luxembourgish word with its equivalent in the reader's language.

During the hackathon, developments were made to enhance the information provided in a tooltip, with the pronunciation (phonetic and audio) and the category of the word (verb, noun, etc.). We also tested the reverse functionality, replacing certain words on a page in Luxembourgish with their English equivalents. A customisation window lets you adjust the expected level of difficulty. We therefore look forward to the release of this update.

A new feature for lod.lu: searching for antonyms

While the online dictionary can look up synonyms, nothing currently exists to find the antonyms of a term. That is the challenge taken up by a duo of developers - and it is not easy at first sight: you have to go and find the data elsewhere, in this case on WordNet, a lexical database for English. The English translation, extracted from lod.lu, is submitted to WordNet, which provides the antonym. The antonym is then sent to lod.lu for translation into Luxembourgish. Unfortunately, this does not work in all cases, as the LOD database is currently limited to 30,000 words. A total of 3,110 antonyms were found. As with other projects involving AI, human checks need to be carried out to ensure the relevance of the proposed antonym. Inappropriate homonyms, for example, can appear in the proposals.

This innovative method, based on the "data augmentation" process, is a pertinent proposal that has the merit of raising the question of the paths to be taken to offer lod.lu users this functionality in the future.

In search of the lemma to facilitate concordances

Essential in the context of natural language processing applications, stemming makes it possible to find the root form, common to all the grammatical forms of a word. This does not always correspond to a real word. For example, stemming "historical" gives "histori", "studies" gives "studi". Depending on the case, we may prefer to recover the lemma, which is the unconjugated base word. 'walked', 'walks' or 'walking' are some inflections of a common lemma: 'walk'. Along the same lines, the word "better" has "good" as its lemma.

The advantage of this approach is that it limits the number of different words with relatively similar meanings, for instance in the context of semantic research. One of the practical examples presented at the hackathon would consist in evaluating the similarity between job offers or classifying the skills required in a job offer, using natural language processing.

The R language and LOD data were used to extract lemmas from Luxemburgish content. The methodology could help the development of search engines qualified for the Luxembourgish language.

A double goal for the Zenter fir d'Lëtzebuerger Sprooch

The ZLS, represented by Alexandre Ecker and Sven Collette, accompanied the various teams during these two active days. The intention was twofold: firstly, to introduce the possibilities and latest developments of the lod.lu and schreifmaschinn.lu websites, but also to understand what specific expectations there might be in terms of development, what new sources the lod.lu API could offer and, last but not least, to understand what reuses proposed during these two days could, one day or another, be transformed into real products or additions to the platforms already online.

This is one of the goals of Open Data Hackathons: to ensure that the innovation born out of these development sessions in the GovTech Lab can be pursued and developed into a mainstream product.

With this session just over, the question of the 2024 hackathon arises: 2022 focused on the theme of housing, 2023 on language. What theme would you like to see us working on in 2024? Feel free to tell us what would you be interested in.

Datasets 6