The Talk host will be Mr Kurt Micallef (RSO, University of Malta).
Large pre-trained language models have become a core component in many Natural Language Processing (NLP) tasks. BERT is one such which has gained popularity, due to its state-of-the-art performance in a variety of downstream tasks and the relatively simple architecture needed to fine-tune it to a particular task. Although this model is specific to English, other variants of BERT have been released for other languages (e.g. CamemBERT for French, AraBERT for Arabic) as well as multiple languages at the same time (e.g. multilingual BERT).
In this talk, I will introduce newly developed language models for Maltese — BERTu and mBERTu. BERTu was trained on a new version of the Korpus Malti, containing approximately 466 million tokens (2.52GB), which is also part of this work.
We will go over the conceptual ideas on how this kind of models make use of these large corpora to learn language representations. I will present the state-of-the-art results that the new models obtain on various syntactic tagging benchmarks as well as an evaluation on a sentiment analysis dataset. Furthermore, I will provide insights on how using different pre-training data sizes and domains affects the downstream performance.
This event will be taking place in a hybrid manner, if you wish to attend on campus the venue is ICT Boardroom whilst if you are unable to attend physically the below Zoom link can be accessed for this event.
Meeting ID: 983 0665 5581
Passcode: 914155