Abstract:
Stemming is widely used in information retrieval tasks. Many researchers demonstrate that
stemming improves the performance of information retrieval systems. Stemming is a technique
for reducing inflection and derivation of morphological variations of words to their stem or root
form. It's useful for improving retrieval efficiency, particularly for text searches, and for
resolving mismatch issues.
The aim of this study was designing and developing a hybrid stemmer for Ge'ez language text.
We have used two approaches namely affixes removal and character n-gram technique. The
proposed methods can remove prefixes, infixes, suffixes and its combinations. To remove all
affixes, rules are compiled individually for each affixes and exceptional and recording rules are
also integrated based on the nature of Geez language morphology. Corpus is manually prepared
from ready available sources such as text books, magazine and bible. The size of the prepared
corpus has 13,221 word tokens. From the prepared corpus, 20% was used for testing the
proposed stemmer.
To evaluate the proposed stemmer manual error counting mechanism was used. The proposed
stemmers are evaluated in two stages; first the affixes removal version is evaluated on a testing
dataset with 2644 word length and secondly the hybrid version is evaluated on the same testing
dataset. According to the evaluation results, affixes removal version registered an accuracy of
92.32% with 7.68% error rates and the hybrid version stemmer also recorded an accuracy of
94.5% with 5.5% error rates. The hybrid version stemmer increased by 2.18% accuracy. Over
stemming and under stemming errors are observed on either of the affixes removal and hybrid
version stemmer. As a result, 4.5% and 2.2% over stemming and 3.18% and 3.3% under steming
errors are shown respectively on the proposed stemmer. Generally our proposed hybrid stemmer
out performed better by 12.26% and 8.28% accuracy with reducing 12.08% and 7.28% error
rates than the previous rule based and longest match stemmers respectively. This is due to
incorporating exceptional and recording rules based on the detailed study of the language.
Finally we found that, our proposed hybrid stemmer was encouraging and using this tool as a
pre-processing module for further research may be helpful