Do computer scientists deeply understand the traditional Arabic morphology? هل يفهم المهندسون الحاسوبيّون علم الصرف فهماً عميقاً؟

Eric Laporte; Alexis  Neme

Since 1990, several teams of computer scientists have implemented the traditional model of Arabic morphology in systems of Natural Language Processing (NLP) without questioning its aims, assumptions, definitions, and purposes.

Related Papers

Language Resources and Evaluation

2019 •

Alexis Neme

Vowels in Arabic are optional orthographic symbols written as diacritics above or below letters. In Arabic texts, typically more than 97 percent of written words do not explicitly show any of the vowels they contain; that is to say, depending on the author, genre and field, less than 3 percent of words include any explicit vowel. Although numerous studies have been published on the issue of restoring the omitted vowels in speech technologies, little attention has been given to this problem in papers dedicated to written Arabic technologies. In this research, we present Arabic-Unitex, an Arabic Language Resource, with emphasis on vowel representation and encoding. Specifically, we present two dozens of rules formalizing a detailed description of vowel omission in written text. They are typographical rules integrated into large-coverage resources for morphological annotation. For restoring vowels, our resources are capable of identifying words in which the vowels are not shown, as well as words in which the vowels are partially or fully included. By taking into account these rules, our resources are able to compute and restore for each word form a list of compatible fully vowelized candidates through omission-tolerant dictionary lookup. In our previous studies, we have proposed a straightforward encoding of taxonomy for verbs (Neme, 2011) and broken plurals (Neme & Laporte, 2013). While traditional morphology is based on derivational rules, our description is based on inflectional ones. The breakthrough lies in the reversal of the traditional root-and-pattern Semitic model into pattern-and-root, giving precedence to patterns over roots. The lexicon is built and updated manually and contains 76,000 fully vowelized lemmas. It is then inflected by means of finite-state transducers (FSTs), generating 6 million forms. The coverage of these inflected forms is extended by formalized grammars, which accurately describe agglutinations around a core verb, noun, adjective or preposition. A laptop needs one minute to generate the 6 million inflected forms in a 340-Megabyte flat file, which is compressed in two minutes into 11 Megabytes for fast retrieval. Our program performs the analysis of 5,000 words/second for running text (20 pages/second). Based on these comprehensive linguistic resources, we created a spell checker that detects any invalid/misplaced vowel in a fully or partially vowelized form. Finally, our resources provide a lexical coverage of more than 99 percent of the words used in popular newspapers, and restore vowels in words (out of context) simply and efficiently.

Do computer scientists deeply understand the traditional Arabic morphology? (What to keep and what to drop from this tradition?) Alexis Amid Neme and Eric Laporte ‫هل يفهم ال هندسو الح سو يّو‬ ‫علم الصرف فه ع يق ؟‬ Since 1990, several teams of computer scientists have implemented the traditional model of Arabic morphology in systems of Natural Language Processing (NLP) without questioning its aims, assumptions, definitions, and purposes. Early grammarians and lexicographers had designed Arabic morphology and lexicography for human minds tooled up with paper; whereas we should design Arabic computational morphology for humans equipped with processors and memory devices. These two statements do not seem obvious to computer scientists. The aim of forerunners of grammar in the eighth century was to discover the features of the Arabic language, and they had political and religious incentives. These pioneers accumulated knowledge in semantics, syntax, morphology, phonology and lexicography, produced inventories in order to standardise the language, generating the massive grammatical production of that time. Teaching for native and non-native speakers probably soon became a pressing goal due to geographical expansion. Language teaching has always been focused on vocabulary, word meaning and text understanding. As for other Semitic languages, Arabic morphology was established around theabstract notion of root, three consonants representing a meaning, whether precise or vague. The traditional derivational morphology based on the root-and-pattern model has developed around this abstract consonantal root. In this model, each word is represented by the intersection of a root and a pattern, such as kitaAb = [ktb &1i2aA3] (kitaAb, book, ‫)كت‬i. A pattern is a discontinuous affix (or transfix), made of vowels and non-radical consonants inserted around slots for the root consonants. To each pattern, traditional grammar associates a morphological category and/or inflectional features, and/or semantic features such as agent (kaAtib, writer, ‫)ك ت‬, patient (makotuwb, letter, ‫)مكتو‬, instrument (makotab, office table, ‫)مكت‬, place (makotabap, library, ‫)مكت ة‬, colour (Oaswad, black, ‫)أسود‬, etc. but none is safely predictable. Each subfield of traditional grammar has a different status:    The most prestigious is syntax with a focus on syntactic functions such as subject, object or indirect object, etc. in order to determine the correct case suffixes; Second comes derivational morphology, with the root-and-pattern model, where the root provides a general meaning and the pattern gives the word a part-of-speech and simultaneously functional and semantic feature(s); Inflectional morphology is the less prestigious subfield, since it deals with form variations, not meaning variations. But what should computational morphology keep or drop from the traditional model? Prestige is not the point. What matters is the goal of computational morphology, which is to formalise and manage forms, not meaning. Word derivation may perfectly remain out of scope of computational morphology, at least in its first phase of development. When systems include a partial implementation of word derivation, this adds an unnecessary level of complexity. The first goal of Arabic computational morphology should be inflectional morphology and production of accurate inflectional resources, as it is for French or English. Only reliable information can be used in computational morphology. The pattern and root concepts in the model should be reduced to reliable phonological and orthographical representations: sequences of consonants and vowels. The semantic and syntactic information traditionally attached to roots and patterns is not predictable. A pattern should be a sequence of consonants and vowels occurring together around root consonant slots. A root should be a sequence of consonants. At that level, word derivation and semantics do not fit in a reliable account of formalisation of forms. Thus, computational morphology should formalise only inflection, yes: the least prestigious part of morphology. Arabic morphological analysers designed by computer scientists often include in their formalisation a partial description of word formation and semantics, taken directly from grammatical tradition. By doing so, computer scientists probably hope that the output of their systems will be more comprehensive and more useful for the next steps in an NLP pipeline. But such additional information is too incomplete and disorderly for use in information technology. And these scholars miss computational morphology’s first goal, i.e. a formal, clean, updatable, and accurate account of form variations. The notion of combining roots with patterns, which has been tested for over twelve centuries, is the backbone of Semitic morphology; it is directly applicable to information technology and should be kept in computational morphology. Moreover, it works equally well with derived words and when 'roots' have 4 consonants or even more: as far as inflection is concerned, the broken plural of minodiyl-manaAdiyl (napkin-napkins, ‫ )منديل من ديل‬is well described by [mndl & 1a2aa3ii4], with the same plural pattern as Eunoquwd-EanaAqiyd (cluster-clusters, ‫)عنقود عن قيد‬. As all regularities lie in patterns, not in rules, the key to Arabic computational morphology is to assign patterns to words first, and determine their roots in consequence, thus reversing the traditional root-and-pattern precedence in favour of the pattern-and-root model. Excluding rare cases, computer scientists are reiterating like parrots concepts of traditional morphology in papers and books, unable to question this tradition since they do not master it. Back to our question in the beginning; do computer scientists deeply understand Arabic morphology? - No, they are interested superficially in morphology as they are too busy with algorithms. Previous posting Part1: Why do computer scientists fail to produce an accurate Arabic lexical resource? Next posting (last part) Part 3: Why will computer scientists continue failing to produce an accurate Arabic lexical resource? And what to do? References Neme, Alexis, Laporte Éric (2013). Pattern-and-root inflectional morphology: the Arabic broken plural. Language Sciences. Neme, Alexis (2011). A lexicon of Arabic verbs constructed on the basis of Semitic taxonomy and using finite-state transducers. In Proceedings of the International Workshop on Lexical Resources (WoLeR) at ESSLLI. Neme, Alexis (2014). Why Microsoft Arabic Spell checker is ineffective. See also Arabic Verb Conjugation (Tasrif), a prototype website. i In this transliteration, upper-case and lower-case letters, e.g. E and e, denote distinct, independent consonants and “o” zero-vowel or sukuun: ‫ء‬, c; ‫آ‬, C; ‫أ‬, O; ‫ؤ‬, W; ‫ إ‬, I ; ‫ئ‬, e; ‫ا‬, A; ‫ب‬, b; ‫ة‬, p; ‫ث‬, t; ‫د‬, v; ‫ج‬, j; ‫ح‬, H; ‫خ‬, x; ‫د‬, d; ‫ر‬, J; ‫ر‬, r; ‫ز‬, z; ‫ش‬, s; ‫ظ‬, M; ‫ص‬, S; ‫ض‬, D; ‫ط‬, T; ‫ظ‬, Z; ‫ع‬, E; ‫غ‬, g; ‫ف‬, f; ‫ق‬, q; ‫ك‬, k; ‫ل‬, l; ‫و‬, m; , n; ‫ه‬, h; , w; , Y; , y;ً , F;ً , K;ً , a;ً , u;ً , i;ً , G;ً , o.

RELATED PAPERS

RELATED TOPICS

Log In

Do computer scientists deeply understand the traditional Arabic morphology? هل يفهم المهندسون الحاسوبيّون علم الصرف فهماً عميقاً؟

Do computer scientists deeply understand the traditional Arabic morphology? هل يفهم المهندسون الحاسوبيّون علم الصرف فهماً عميقاً؟

Do computer scientists deeply understand the traditional Arabic morphology? هل يفهم المهندسون الحاسوبيّون علم الصرف فهماً عميقاً؟

Do computer scientists deeply understand the traditional Arabic morphology? هل يفهم المهندسون الحاسوبيّون علم الصرف فهماً عميقاً؟

Do computer scientists deeply understand the traditional Arabic morphology? هل يفهم المهندسون الحاسوبيّون علم الصرف فهماً عميقاً؟

Related Papers

RELATED PAPERS

RELATED TOPICS