Jimma University Open access Institutional Repository

Designing And Developing Stemmer For Ge‟Ez Language Text: A Hybrid Approach

Show simple item record

dc.contributor.author Solomon Nigatu
dc.contributor.author Solomon Teferra
dc.contributor.author Teklu Urgessa
dc.date.accessioned 2023-06-07T08:39:17Z
dc.date.available 2023-06-07T08:39:17Z
dc.date.issued 2021-11
dc.identifier.uri https://repository.ju.edu.et//handle/123456789/8164
dc.description.abstract Stemming is widely used in information retrieval tasks. Many researchers demonstrate that stemming improves the performance of information retrieval systems. Stemming is a technique for reducing inflection and derivation of morphological variations of words to their stem or root form. It's useful for improving retrieval efficiency, particularly for text searches, and for resolving mismatch issues. The aim of this study was designing and developing a hybrid stemmer for Ge'ez language text. We have used two approaches namely affixes removal and character n-gram technique. The proposed methods can remove prefixes, infixes, suffixes and its combinations. To remove all affixes, rules are compiled individually for each affixes and exceptional and recording rules are also integrated based on the nature of Geez language morphology. Corpus is manually prepared from ready available sources such as text books, magazine and bible. The size of the prepared corpus has 13,221 word tokens. From the prepared corpus, 20% was used for testing the proposed stemmer. To evaluate the proposed stemmer manual error counting mechanism was used. The proposed stemmers are evaluated in two stages; first the affixes removal version is evaluated on a testing dataset with 2644 word length and secondly the hybrid version is evaluated on the same testing dataset. According to the evaluation results, affixes removal version registered an accuracy of 92.32% with 7.68% error rates and the hybrid version stemmer also recorded an accuracy of 94.5% with 5.5% error rates. The hybrid version stemmer increased by 2.18% accuracy. Over stemming and under stemming errors are observed on either of the affixes removal and hybrid version stemmer. As a result, 4.5% and 2.2% over stemming and 3.18% and 3.3% under steming errors are shown respectively on the proposed stemmer. Generally our proposed hybrid stemmer out performed better by 12.26% and 8.28% accuracy with reducing 12.08% and 7.28% error rates than the previous rule based and longest match stemmers respectively. This is due to incorporating exceptional and recording rules based on the detailed study of the language. Finally we found that, our proposed hybrid stemmer was encouraging and using this tool as a pre-processing module for further research may be helpful en_US
dc.language.iso en_US en_US
dc.subject Geez Stemmer, Information Retrieval, N-Gram, Hybrid Stemmer, Natural Language Processing, Conflation. en_US
dc.title Designing And Developing Stemmer For Ge‟Ez Language Text: A Hybrid Approach en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search IR


Browse

My Account