Normalization of Arabic Dialects into Modern Standard Arabic using BERT and GPT-2

Alnajjar, Khalid; Hämäläinen, Mika

doi:10.46298/jdmdh.13146

Khalid Alnajjar ; Mika Hämäläinen - Normalization of Arabic Dialects into Modern Standard Arabic using BERT and GPT-2

jdmdh:13146 - Journal of Data Mining & Digital Humanities, 29 avril 2024, NLP4DH - https://doi.org/10.46298/jdmdh.13146

Normalization of Arabic Dialects into Modern Standard Arabic using BERT and GPT-2Article

Auteurs : Alnajjar, Khalid ¹; Hämäläinen, Mika ²

1 Rootroo Ltd
2 Helsinki Metropolia University of Applied Sciences

We present an encoder-decored based model for normalization of Arabic dialects using both BERT and GPT-2 based models. Arabic is a language of many dialects that not only differ from the Modern Standard Arabic (MSA) in terms of pronunciation but also in terms of morphology, grammar and lexical choice. This diversity can be troublesome even to a native Arabic speaker let alone a computer. Several NLP tools work well for MSA and in some of the main dialects but fail to cover Arabic language as a whole. Based on our manual evaluation, our model normalizes sentences entirely correctly 46\% of the time and almost correctly 26\% of the time.

https://doi.org/10.46298/jdmdh.13146

Source : zenodo.org:10998947

Volume : NLP4DH

Publié le : 29 avril 2024

Accepté le : 9 avril 2024

Soumis le : 28 février 2024

Mots-clés : Arabic, Dialect, Normalization

Licence : Attribution 4.0 International (CC BY 4.0)

Fichiers

Nom	Taille
Arabic_normalization_1_.pdf md5 : 63424c53345dd95915e76f2c861b063c	208.35 KB

Khalid Alnajjar ; Mika Hämäläinen - Normalization of Arabic Dialects into Modern Standard Arabic using BERT and GPT-2

Fichiers

Références bibliographiques

2 Documents citant cet article

Partager et exporter

Statistiques de consultation