Abstract:
Stemmer is typically used as a standalone module in the architecture of NLP systems. It is
particularly important for the development of search engines, MT, speech recognition, text
categorization, IE, and text summarization. In the study of language morphology, stemming is the
reduction of inflected (or occasionally derived) words to their stem, base, or root form. Despite
that, Dawurootsuwa has no stemming algorithm developed yet to apply NLP applications to the
language. Hence, this language needs an automatic word conflation system.
A Rule-based stemming system for the Dawuro language(Dawurootsuwa) is described in this
thesis work. It is based on the most popular English language stemmer which is the Porter
stemmer. The system in this study uses a word as input and executes an algorithm based on a set
of steps made up of several rules. There are several contexts for every stemming rule in the
Dawurootsuwa. When designing the stemmer, contexts are considered accordingly. A thorough
understanding of language morphology is required for this kind of thesis work. So, Dawurootsuwa
morphology was studied and described in detail to model the language and develop an automatic
procedure for conflation. The stemmer was designed by categorizing words based on their affixes.
The outcome of this study is a rule-based context-sensitive iterative stemmer for Dawurootsuwa.
This stemmer's performance was assessed mainly using the error counting technique. To develop
a test set that covers a range of topics, totaling 3000 words were collected for testing purposes
mainly from several published papers regarding Dawurootsuwa morphology, religious books (like
Bible), and Dawurootsuwa-Amharic-English dictionary to make a test set cover variety of issues
however training set is distinct from the test set. A system evaluation reveals that the algorithm
accuracy yields 93.96 percent accurate outputs. A 6.02 percent error rate is considered to be
typical and other evaluation techniques are briefly explored at the end.