Afaan Oromo News Text Classification Using A Deep Learning Approach

Lalisa Tadesa; Getachew Mamo; Amanuel Asseffa

Afaan Oromo News Text Classification Using A Deep Learning Approach

Lalisa Tadesa; Getachew Mamo; Amanuel Asseffa

URI: https://repository.ju.edu.et//handle/123456789/7434

Date: 2022-06

Abstract:

Today development of the internet has made Afaan Oromo texts easily available and widespread online. Along with the ever-increasing volume of information resources, there is an increasing interest in better solutions for finding, filtering, and organizing these resources and automatic text classification is an invertible solution for it. Text classification (TC) is also called text categorization which means classifying text or document to predefined labels. This study is proposed to utilize the deep learning algorithm with word embedding on Afaan Oromo’s news text classifications. Because feature value extraction in news is difficult, this work provides deep learning algorithm for news text classification. To classify the text data, the earlier approaches used a bag of words to represent the words of the text data, and the information gained from word order, which is an important factor for the classification of news text was not considered. Although the earlier models have a low time complexity, the context and potential semantic relationship of text words are not fully considered, and as the number of the feature and classes increased the accuracy of the models decreased. The objective of this thesis is to apply deep learning approaches to Afaan Oromo news text categorization using CNN, LSTM, and BiLSTM algorithm which is a variant of RNN with word embedding, and to recommend the best for the problem at hand. To develop these models in this study, six thousand one hundred ten (6110) newly collected and annotated news datasets have been used to build the model for the Afaan Oromo language and around 1,731,856 unannotated words are scraped from the Afaan Oromo news domain to develop pre-trained word embedding model. In this work, various natural language processing tasks such as text preprocessing which includes normalization, tokenization, text cleaning, and removal of stop words are performed. For word representation, word2vec word embedding of probability word predictions is selected as it shows great accuracy than the fastText and embedded. Lastly, the result of our models is compared and CNN has great accuracy with 98.4% accuracy, and 98.4% precision and LSTM and BiLSTM have got 95% accuracy, and 94% precision, and 97.28% accuracy, and 97.36% precision respectively.

Show full item record