Designing And Developing Stemmer For Ge‟Ez  Language Text: A Hybrid Approach

Solomon Nigatu; Solomon Teferra; Teklu Urgessa

dc.contributor.author	Solomon Nigatu
dc.contributor.author	Solomon Teferra
dc.contributor.author	Teklu Urgessa
dc.date.accessioned	2023-06-07T08:39:17Z
dc.date.available	2023-06-07T08:39:17Z
dc.date.issued	2021-11
dc.identifier.uri	https://repository.ju.edu.et//handle/123456789/8164
dc.description.abstract	Stemming is widely used in information retrieval tasks. Many researchers demonstrate that stemming improves the performance of information retrieval systems. Stemming is a technique for reducing inflection and derivation of morphological variations of words to their stem or root form. It's useful for improving retrieval efficiency, particularly for text searches, and for resolving mismatch issues. The aim of this study was designing and developing a hybrid stemmer for Ge'ez language text. We have used two approaches namely affixes removal and character n-gram technique. The proposed methods can remove prefixes, infixes, suffixes and its combinations. To remove all affixes, rules are compiled individually for each affixes and exceptional and recording rules are also integrated based on the nature of Geez language morphology. Corpus is manually prepared from ready available sources such as text books, magazine and bible. The size of the prepared corpus has 13,221 word tokens. From the prepared corpus, 20% was used for testing the proposed stemmer. To evaluate the proposed stemmer manual error counting mechanism was used. The proposed stemmers are evaluated in two stages; first the affixes removal version is evaluated on a testing dataset with 2644 word length and secondly the hybrid version is evaluated on the same testing dataset. According to the evaluation results, affixes removal version registered an accuracy of 92.32% with 7.68% error rates and the hybrid version stemmer also recorded an accuracy of 94.5% with 5.5% error rates. The hybrid version stemmer increased by 2.18% accuracy. Over stemming and under stemming errors are observed on either of the affixes removal and hybrid version stemmer. As a result, 4.5% and 2.2% over stemming and 3.18% and 3.3% under steming errors are shown respectively on the proposed stemmer. Generally our proposed hybrid stemmer out performed better by 12.26% and 8.28% accuracy with reducing 12.08% and 7.28% error rates than the previous rule based and longest match stemmers respectively. This is due to incorporating exceptional and recording rules based on the detailed study of the language. Finally we found that, our proposed hybrid stemmer was encouraging and using this tool as a pre-processing module for further research may be helpful	en_US
dc.language.iso	en_US	en_US
dc.subject	Geez Stemmer, Information Retrieval, N-Gram, Hybrid Stemmer, Natural Language Processing, Conflation.	en_US
dc.title	Designing And Developing Stemmer For Ge‟Ez Language Text: A Hybrid Approach	en_US
dc.type	Thesis	en_US