Abstract:
Users of social media can share and consume information freely. This opportunity leads them to
disseminate toxic information which we can call offensive language. In a country like Ethiopia
where multi nations and nationalities are living together, sharing an offensive language on social
media can negatively affects the welfare of ethnic groups, political party and religious view of
the society. Therefore, we aimed to develop an offensive language detection and categorization
model for Afaan Oromo text available on social media like Facebook and Twitter pages using
supervised machine learning techniques. In order to evaluate the performance of our models, we
collected 1051 posts/comments/tweets from Facebook and Twitter pages of different users
manually. Lawyer and linguistic experts had been involved for data annotation. In order to have
an appropriate version of dataset, all preprocessing task such as tokenization, normalization, stop
word removal and special character removal were applied on the data collected from different
sources. For classification purpose, five machine learning techniques such as Support Vector
Machine (SVM), Multinomial naïve Bayes (MNB), Decision Tree (DT), K-Nearest Neighbors
(KNN) and Logistic Regression (LR) have been used. We developed two automatic
classification systems, which are offensive language detection system and offensive language
categorization system. In a detection of offensive language, the best performing technique was
MNB achieved 86% precision, 83% accuracy and 85% of micro averaged F1-score. Similarly,
in a categorization of offensive language, the best performing technique was SVM achieved 82%
of precision, 56% of accuracy and 61% of micro averaged F1-Score.