Khalid Alnajjar ; Mika Hämäläinen - Normalization of Arabic Dialects into Modern Standard Arabic using BERT and GPT-2

jdmdh:13146 - Journal of Data Mining & Digital Humanities, 29 avril 2024, NLP4DH - https://doi.org/10.46298/jdmdh.13146
Normalization of Arabic Dialects into Modern Standard Arabic using BERT and GPT-2Article

Auteurs : Alnajjar, Khalid ORCID1; Hämäläinen, Mika ORCID2

  • 1 Rootroo Ltd
  • 2 Helsinki Metropolia University of Applied Sciences

We present an encoder-decored based model for normalization of Arabic dialects using both BERT and GPT-2 based models. Arabic is a language of many dialects that not only differ from the Modern Standard Arabic (MSA) in terms of pronunciation but also in terms of morphology, grammar and lexical choice. This diversity can be troublesome even to a native Arabic speaker let alone a computer. Several NLP tools work well for MSA and in some of the main dialects but fail to cover Arabic language as a whole. Based on our manual evaluation, our model normalizes sentences entirely correctly 46\% of the time and almost correctly 26\% of the time.


Volume : NLP4DH
Publié le : 29 avril 2024
Accepté le : 9 avril 2024
Soumis le : 28 février 2024
Mots-clés : Arabic,Dialect,Normalization

Fichiers

Nom Taille
Arabic_normalization_1_.pdf
md5 : 63424c53345dd95915e76f2c861b063c
208.35 KB

Statistiques de consultation

Cette page a été consultée 634 fois.
Le PDF de cet article a été téléchargé 274 fois.