Developing a Stemmer for Dawurootsuwa Language Using a  Rule-Based Approach

Habtamu Dubale; Getachew Mamo; Zerihun Olana

Developing a Stemmer for Dawurootsuwa Language Using a Rule-Based Approach

Habtamu Dubale; Getachew Mamo; Zerihun Olana

URI: https://repository.ju.edu.et//handle/123456789/8633

Date: 2023-06

Abstract:

Stemmer is typically used as a standalone module in the architecture of NLP systems. It is particularly important for the development of search engines, MT, speech recognition, text categorization, IE, and text summarization. In the study of language morphology, stemming is the reduction of inflected (or occasionally derived) words to their stem, base, or root form. Despite that, Dawurootsuwa has no stemming algorithm developed yet to apply NLP applications to the language. Hence, this language needs an automatic word conflation system. A Rule-based stemming system for the Dawuro language(Dawurootsuwa) is described in this thesis work. It is based on the most popular English language stemmer which is the Porter stemmer. The system in this study uses a word as input and executes an algorithm based on a set of steps made up of several rules. There are several contexts for every stemming rule in the Dawurootsuwa. When designing the stemmer, contexts are considered accordingly. A thorough understanding of language morphology is required for this kind of thesis work. So, Dawurootsuwa morphology was studied and described in detail to model the language and develop an automatic procedure for conflation. The stemmer was designed by categorizing words based on their affixes. The outcome of this study is a rule-based context-sensitive iterative stemmer for Dawurootsuwa. This stemmer's performance was assessed mainly using the error counting technique. To develop a test set that covers a range of topics, totaling 3000 words were collected for testing purposes mainly from several published papers regarding Dawurootsuwa morphology, religious books (like Bible), and Dawurootsuwa-Amharic-English dictionary to make a test set cover variety of issues however training set is distinct from the test set. A system evaluation reveals that the algorithm accuracy yields 93.96 percent accurate outputs. A 6.02 percent error rate is considered to be typical and other evaluation techniques are briefly explored at the end.

Show full item record