Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Since 1990, several teams of computer scientists have implemented the traditional model of Arabic morphology in systems of Natural Language Processing (NLP) without questioning its aims, assumptions, definitions, and purposes.
Language Resources and Evaluation
Restoring Arabic vowels through omission-tolerant dictionary lookup ْ تش ْ كيلْالك ْ ل ْمات ْ ر ب ع ْ ْ م ّة واردْحاسوبي2019 •
Vowels in Arabic are optional orthographic symbols written as diacritics above or below letters. In Arabic texts, typically more than 97 percent of written words do not explicitly show any of the vowels they contain; that is to say, depending on the author, genre and field, less than 3 percent of words include any explicit vowel. Although numerous studies have been published on the issue of restoring the omitted vowels in speech technologies, little attention has been given to this problem in papers dedicated to written Arabic technologies. In this research, we present Arabic-Unitex, an Arabic Language Resource, with emphasis on vowel representation and encoding. Specifically, we present two dozens of rules formalizing a detailed description of vowel omission in written text. They are typographical rules integrated into large-coverage resources for morphological annotation. For restoring vowels, our resources are capable of identifying words in which the vowels are not shown, as well as words in which the vowels are partially or fully included. By taking into account these rules, our resources are able to compute and restore for each word form a list of compatible fully vowelized candidates through omission-tolerant dictionary lookup. In our previous studies, we have proposed a straightforward encoding of taxonomy for verbs (Neme, 2011) and broken plurals (Neme & Laporte, 2013). While traditional morphology is based on derivational rules, our description is based on inflectional ones. The breakthrough lies in the reversal of the traditional root-and-pattern Semitic model into pattern-and-root, giving precedence to patterns over roots. The lexicon is built and updated manually and contains 76,000 fully vowelized lemmas. It is then inflected by means of finite-state transducers (FSTs), generating 6 million forms. The coverage of these inflected forms is extended by formalized grammars, which accurately describe agglutinations around a core verb, noun, adjective or preposition. A laptop needs one minute to generate the 6 million inflected forms in a 340-Megabyte flat file, which is compressed in two minutes into 11 Megabytes for fast retrieval. Our program performs the analysis of 5,000 words/second for running text (20 pages/second). Based on these comprehensive linguistic resources, we created a spell checker that detects any invalid/misplaced vowel in a fully or partially vowelized form. Finally, our resources provide a lexical coverage of more than 99 percent of the words used in popular newspapers, and restore vowels in words (out of context) simply and efficiently.
Language Sciences
Pattern-and-root inflectional morphology: the Arabic broken plural2013 •
We present a substantially implemented model of description of the inflectional morphology of Arabic nouns, with special attention to the management of dictionaries and other language resources by Arabic-speaking linguists. Our model includes broken plurals (BPs), i.e. plurals formed by modifying the stem. It is based on the traditional notions of root and pattern of Semitic morphology. However, as compared to traditional Arabic morphology, it keeps the formal description of inflection separate from that of derivation and semantics. As traditional Arabic dictionaries, the updatable dictionary is structured in lexical entries for lemmas, and the reference spelling is fully diacritized. In our model, morphological analysis of Arabic text is performed directly with a dictionary of words and without morphophonological rules. Our taxonomy for noun inflection is simple, orderly and detailed. We simplify the taxonomy of singular patterns by specifying vowel quantity as v or vv, and ignoring vowel quality. Root alternations and orthographical variations are encoded independently from patterns and in a factual way, without deep roots or morphophonological or orthographical rules. Nouns with a triliteral BP are classified according to 22 patterns subdivided into 90 classes, and nouns with a quadriliteral BP according to 3 patterns subdivided into 70 classes. These 160 classes become 300 inflectional classes when we take into account inflectional variations that affect only the singular. We provide a straightforward encoding scheme that we applied to 3 200 entries of BP nouns.
Although the significance of morphological structure is established in visual word processing, its role in auditory processing remains unclear. Using magnetoencephalography we probe the significance of the root morpheme for spoken Arabic words with two experimental manipulations. First we compare a model of auditory processing that calculates probable lexical outcomes based on whole-word competitors, versus a model that only considers the root as relevant to lexical identification. Second, we assess violations to the root-specific Obligatory Contour Principle (OCP), which disallows root-initial consonant gemination. Our results show root prediction to significantly correlate with neural activity in superior temporal regions, independent of predictions based on whole-word competitors. Furthermore, words that violated the OCP constraint were significantly easier to dismiss as valid words than probabilitymatched counterparts. The findings suggest that lexical auditory processing is dependent upon morphological structure, and that the root forms a principal unit through which spoken words are recognised.
2013 •
(Available also .doc older version written in Arabic) "Since 1997, the MS Arabic spell checker was integrated by Coltec-Egypt in the MS-Office suite and till now many Arabic users find it worthless. In this study, we show why the MS-spell checker fails to attract Arabic users. After spell-checking a document (10 pages - 3300 words in Arabic), the assessment procedure spots 78 false positive errors. They reveal the lexical resource flaws: an unsystematic lexical coverage of the feminine and the broken plural of nouns and adjectives, and an arbitrary coverage of verbs and nouns with prefixed or suffixed particles. This unsystematic and arbitrary lexical coverage of the language resources pinpoints the absence of a clear definition of a lexical entry and an inadequate design of the related agglutination rules. Finally, this assessment reveals in general the failure of scientific and technological policies in big companies and in research institutions regarding Arabic. "
WoLeR 2011 at ESSLLI International Workshop on Lexical Resources – Ljubljana, Slovenia
A lexicon of Arabic verbs constructed on the basis of Semitic taxonomy and using finite-state transducers2011 •
We describe a lexicon of Arabic verbs constructed on the basis of Semitic patterns and used in a resource-based method of morphological annotation of written Arabic text. The annotated output is a graph of morphemes with accurate linguistic information. An enhanced FST implementation for Semitic languages was created. This system is adapted also for generating inflected forms. The language resources can be easily updated. The lexicon is constituted of 15 400 verbal entries. We propose an inflectional taxonomy that increases the lexicon readability and maintainability for Arabic speakers and linguists. Traditional grammar defines inflectional verbal classes by using verbal pattern-classes and root-classes, related to the nature of each of the triliteral root-consonants. Verbal pattern-classes are clearly defined but root-classes are complex. In our taxonomy, traditional pattern-classes are reused and root-classes are simply redefined. Our taxonomy provides a straightforward encoding scheme for inflectional variations and orthographic adjustments due to assimilation and agglutination. We have tested and evaluated our resource against 10 000 diacriticized verb occurrences in the Nemlar corpus and compared it to Buckwalter resources. The lexical coverage is 99.9 % and a laptop needs two minutes in order to generate and compress the inflected lexicon of 2.5 million forms into 4 Megabytes.
Communications in Computer and Information Science
Jabalín: A Comprehensive Computational Model of Modern Standard Arabic Verbal Morphology Based on Traditional Arabic Prosody2013 •
Natural Language Processing (NLP) has increased significance in machine interpretation and different type of applications like discourse combination and acknowledgment, limitation multilingual data frameworks, and so forth. Arabic Named Entity Recognition, Information Retrieval, Machine Translation and Sentiment Analysis are a percentage of the Arabic apparatuses, which have indicated impressive information in knowledge and security organizations. NLP assumes a key part in the preparing stage in Sentiment Analysis, Information Extraction and Retrieval, Automatic Summarization, Question Answering, to name a few. Arabic is a Semitic language, which contrasts from Indo-European lingos phonetically, morphologically, syntactically and semantically. This paper discusses different challenges of NLP in Arabic. In addition, it inspires scientists in this field and others to take measures to handle Arabic dialect challenges.
We describe a fully inflected lexicon of 2.5 million verbal forms generated by using finite-state transducers. The lexicon is constituted of 15 400 verbal entries or lemmas. The lexicon of Arabic verbs is constructed on the basis of Semitic patterns and used in a resource-based method of morphological annotation of written Arabic text. An enhanced FST implementation for Semitic languages was created. This system is adapted also for generating inflected forms. The language resources can be easily updated. We propose an inflectional taxonomy that increases the lexicon readability and maintainability for Arabic speakers and linguists. Traditional grammar defines inflectional verbal classes by using verbal pattern-classes and root-classes, related to the nature of each of the triliteral root-consonants. Verbal pattern-classes are clearly defined but root-classes are complex. In our taxonomy, traditional pattern-classes are reused and root-classes are simply redefined. Our taxonomy provides a straightforward encoding scheme for inflectional variations and orthographic adjustments due to assimilation and agglutination. We have tested and evaluated our resource against 10 000 diacriticized verb occurrences in the Nemlar corpus and compared it to Buckwalter resources. The lexical coverage is 99.9 %. A laptop needs two minutes in order to generate and compress the 2.5 million form lexicon into 4 Megabytes for fast retrieval. The analysis of a verb takes 0.5 millisecond. Résumé en Français (dans le document)
2011 •
Abstract We provide lexical profiling for Arabic by covering two important linguistic aspects of Arabic lexical information, namely morphological inflectional paradigms and syntactic subcategorization frames, making our database a rich repository of Arabic lexicographic details. First, we provide a complete description of the inflectional behaviour of Arabic lemmas based on statistical distribution.
2013 •
Journal of Cognitive Neuroscience
Arabic Morphology in the Neural Language System2010 •
LAP Lambert Academic Publishing
The Interaction Between Inflection and Derivation in English and MSARoutledge Handbook on Arabic Linguistics
Arabic Political Discourse Analysis2010 •
the proceedings of the International Conference on Information Technology and Natural Sciences, Amman/Jordan
Full automatic Arabic text tagging system2003 •
Journal of Arts, King Saud University
TOWARDS A MORPHOLOGICAL THEORY: THE CASE OF ARABIC BROKEN AND SOUND PLURALS2014 •
Artificial Intelligence Review
Hebrew computational linguistics: Past and future2004 •
International Journal on Studies in English Language and Literature (IJSELL) Volume 6, Issue 10, October 2018, PP 41-52 ISSN 2347-3126 (Print) & ISSN 2347-3134 (Online) http://dx.doi.org/10.20431/2347-3134.0610005 www.arcjournals.org
2018d) The Arabic Origins of "Sex Derivatives and Formally Similar Terms Six, Sack, Sake, Suck, Seek, Soak, Kiss, Case, Cozy" in English and European Languages: A Consonantal Radical Theory Approach2018 •
2012 •
The international Arab journal of information technology
A rule-based extensible stemmer for information retrieval with application to Arabic2006 •
Language Technology for Normalisation of Less- …
Compiling Apertium morphological dictionaries with HFST and using them in HFST applications2012 •
Computer Speech and Language
SAMAR: Subjectivity and sentiment analysis for Arabic social media2014 •