Khalid Alnajjar ; Mika Hämäläinen - Normalization of Arabic Dialects into Modern Standard Arabic using BERT and GPT-2

jdmdh:13146 - Journal of Data Mining & Digital Humanities, April 29, 2024, NLP4DH - https://doi.org/10.46298/jdmdh.13146
Normalization of Arabic Dialects into Modern Standard Arabic using BERT and GPT-2Article

Authors: Alnajjar, Khalid ORCID1; Hämäläinen, Mika ORCID2

  • 1 Rootroo Ltd
  • 2 Helsinki Metropolia University of Applied Sciences

We present an encoder-decored based model for normalization of Arabic dialects using both BERT and GPT-2 based models. Arabic is a language of many dialects that not only differ from the Modern Standard Arabic (MSA) in terms of pronunciation but also in terms of morphology, grammar and lexical choice. This diversity can be troublesome even to a native Arabic speaker let alone a computer. Several NLP tools work well for MSA and in some of the main dialects but fail to cover Arabic language as a whole. Based on our manual evaluation, our model normalizes sentences entirely correctly 46\% of the time and almost correctly 26\% of the time.


Volume: NLP4DH
Published on: April 29, 2024
Accepted on: April 9, 2024
Submitted on: February 28, 2024
Keywords: Arabic,Dialect,Normalization

Files

Name Size
Arabic_normalization_1_.pdf
md5: 63424c53345dd95915e76f2c861b063c
208.35 KB

Consultation statistics

This page has been seen 401 times.
This article's PDF has been downloaded 173 times.