Abstract:
Today development of the internet has made Afaan Oromo texts easily available and
widespread online. Along with the ever-increasing volume of information resources, there is
an increasing interest in better solutions for finding, filtering, and organizing these resources
and automatic text classification is an invertible solution for it. Text classification (TC) is also
called text categorization which means classifying text or document to predefined labels. This
study is proposed to utilize the deep learning algorithm with word embedding on Afaan
Oromo’s news text classifications. Because feature value extraction in news is difficult, this
work provides deep learning algorithm for news text classification. To classify the text data,
the earlier approaches used a bag of words to represent the words of the text data, and the
information gained from word order, which is an important factor for the classification of news
text was not considered. Although the earlier models have a low time complexity, the context
and potential semantic relationship of text words are not fully considered, and as the number
of the feature and classes increased the accuracy of the models decreased. The objective of this
thesis is to apply deep learning approaches to Afaan Oromo news text categorization using
CNN, LSTM, and BiLSTM algorithm which is a variant of RNN with word embedding, and to
recommend the best for the problem at hand. To develop these models in this study, six
thousand one hundred ten (6110) newly collected and annotated news datasets have been used
to build the model for the Afaan Oromo language and around 1,731,856 unannotated words
are scraped from the Afaan Oromo news domain to develop pre-trained word embedding
model. In this work, various natural language processing tasks such as text preprocessing
which includes normalization, tokenization, text cleaning, and removal of stop words are
performed. For word representation, word2vec word embedding of probability word
predictions is selected as it shows great accuracy than the fastText and embedded. Lastly, the
result of our models is compared and CNN has great accuracy with 98.4% accuracy, and
98.4% precision and LSTM and BiLSTM have got 95% accuracy, and 94% precision, and
97.28% accuracy, and 97.36% precision respectively.