Academia.eduAcademia.edu
Do computer scientists deeply understand the traditional Arabic morphology? (What to keep and what to drop from this tradition?) Alexis Amid Neme and Eric Laporte ‫هل يفهم ال هندسو الح سو يّو‬ ‫علم الصرف فه ع يق ؟‬ Since 1990, several teams of computer scientists have implemented the traditional model of Arabic morphology in systems of Natural Language Processing (NLP) without questioning its aims, assumptions, definitions, and purposes. Early grammarians and lexicographers had designed Arabic morphology and lexicography for human minds tooled up with paper; whereas we should design Arabic computational morphology for humans equipped with processors and memory devices. These two statements do not seem obvious to computer scientists. The aim of forerunners of grammar in the eighth century was to discover the features of the Arabic language, and they had political and religious incentives. These pioneers accumulated knowledge in semantics, syntax, morphology, phonology and lexicography, produced inventories in order to standardise the language, generating the massive grammatical production of that time. Teaching for native and non-native speakers probably soon became a pressing goal due to geographical expansion. Language teaching has always been focused on vocabulary, word meaning and text understanding. As for other Semitic languages, Arabic morphology was established around theabstract notion of root, three consonants representing a meaning, whether precise or vague. The traditional derivational morphology based on the root-and-pattern model has developed around this abstract consonantal root. In this model, each word is represented by the intersection of a root and a pattern, such as kitaAb = [ktb &1i2aA3] (kitaAb, book, ‫)كت‬i. A pattern is a discontinuous affix (or transfix), made of vowels and non-radical consonants inserted around slots for the root consonants. To each pattern, traditional grammar associates a morphological category and/or inflectional features, and/or semantic features such as agent (kaAtib, writer, ‫)ك ت‬, patient (makotuwb, letter, ‫)مكتو‬, instrument (makotab, office table, ‫)مكت‬, place (makotabap, library, ‫)مكت ة‬, colour (Oaswad, black, ‫)أسود‬, etc. but none is safely predictable. Each subfield of traditional grammar has a different status:    The most prestigious is syntax with a focus on syntactic functions such as subject, object or indirect object, etc. in order to determine the correct case suffixes; Second comes derivational morphology, with the root-and-pattern model, where the root provides a general meaning and the pattern gives the word a part-of-speech and simultaneously functional and semantic feature(s); Inflectional morphology is the less prestigious subfield, since it deals with form variations, not meaning variations. But what should computational morphology keep or drop from the traditional model? Prestige is not the point. What matters is the goal of computational morphology, which is to formalise and manage forms, not meaning. Word derivation may perfectly remain out of scope of computational morphology, at least in its first phase of development. When systems include a partial implementation of word derivation, this adds an unnecessary level of complexity. The first goal of Arabic computational morphology should be inflectional morphology and production of accurate inflectional resources, as it is for French or English. Only reliable information can be used in computational morphology. The pattern and root concepts in the model should be reduced to reliable phonological and orthographical representations: sequences of consonants and vowels. The semantic and syntactic information traditionally attached to roots and patterns is not predictable. A pattern should be a sequence of consonants and vowels occurring together around root consonant slots. A root should be a sequence of consonants. At that level, word derivation and semantics do not fit in a reliable account of formalisation of forms. Thus, computational morphology should formalise only inflection, yes: the least prestigious part of morphology. Arabic morphological analysers designed by computer scientists often include in their formalisation a partial description of word formation and semantics, taken directly from grammatical tradition. By doing so, computer scientists probably hope that the output of their systems will be more comprehensive and more useful for the next steps in an NLP pipeline. But such additional information is too incomplete and disorderly for use in information technology. And these scholars miss computational morphology’s first goal, i.e. a formal, clean, updatable, and accurate account of form variations. The notion of combining roots with patterns, which has been tested for over twelve centuries, is the backbone of Semitic morphology; it is directly applicable to information technology and should be kept in computational morphology. Moreover, it works equally well with derived words and when 'roots' have 4 consonants or even more: as far as inflection is concerned, the broken plural of minodiyl-manaAdiyl (napkin-napkins, ‫ )منديل من ديل‬is well described by [mndl & 1a2aa3ii4], with the same plural pattern as Eunoquwd-EanaAqiyd (cluster-clusters, ‫)عنقود عن قيد‬. As all regularities lie in patterns, not in rules, the key to Arabic computational morphology is to assign patterns to words first, and determine their roots in consequence, thus reversing the traditional root-and-pattern precedence in favour of the pattern-and-root model. Excluding rare cases, computer scientists are reiterating like parrots concepts of traditional morphology in papers and books, unable to question this tradition since they do not master it. Back to our question in the beginning; do computer scientists deeply understand Arabic morphology? - No, they are interested superficially in morphology as they are too busy with algorithms. Previous posting Part1: Why do computer scientists fail to produce an accurate Arabic lexical resource? Next posting (last part) Part 3: Why will computer scientists continue failing to produce an accurate Arabic lexical resource? And what to do? References Neme, Alexis, Laporte Éric (2013). Pattern-and-root inflectional morphology: the Arabic broken plural. Language Sciences. Neme, Alexis (2011). A lexicon of Arabic verbs constructed on the basis of Semitic taxonomy and using finite-state transducers. In Proceedings of the International Workshop on Lexical Resources (WoLeR) at ESSLLI. Neme, Alexis (2014). Why Microsoft Arabic Spell checker is ineffective. See also Arabic Verb Conjugation (Tasrif), a prototype website. i In this transliteration, upper-case and lower-case letters, e.g. E and e, denote distinct, independent consonants and “o” zero-vowel or sukuun: ‫ء‬, c; ‫آ‬, C; ‫أ‬, O; ‫ؤ‬, W; ‫ إ‬, I ; ‫ئ‬, e; ‫ا‬, A; ‫ب‬, b; ‫ة‬, p; ‫ث‬, t; ‫د‬, v; ‫ج‬, j; ‫ح‬, H; ‫خ‬, x; ‫د‬, d; ‫ر‬, J; ‫ر‬, r; ‫ز‬, z; ‫ش‬, s; ‫ظ‬, M; ‫ص‬, S; ‫ض‬, D; ‫ط‬, T; ‫ظ‬, Z; ‫ع‬, E; ‫غ‬, g; ‫ف‬, f; ‫ق‬, q; ‫ك‬, k; ‫ل‬, l; ‫و‬, m; , n; ‫ه‬, h; , w; , Y; , y;ً , F;ً , K;ً , a;ً , u;ً , i;ً , G;ً , o.