Abstract:
Social media today affects a nation's social, political, and economic facets in both positive and negative ways.
Positive effects include the facilitation of digital opinion exchanges and the rapid and broad dissemination of
information. The spread of hate speech, which includes disparaging individuals based on shared traits like
gender (sexism), race, religion, color, disability, and nationality, has a negative effect. Protected characteristics
are defined as being against the law to discriminate against someone because of gender (sexism), race, religion,
color, disability, or nationality. The use of social media platforms, like Facebook and Twitter, to organize
hateful events and spread hate speech has become more common. The unstructured nature of social media data
makes manual tracking more challenging. Thus, we are motivated to continue developing the detection of hate
speech and harassment identification based on protected characteristics. The study aims to develop a method for
harassment and hate speech detection and identification on social media based on protected characteristics of
the Afaan Oromo language using deep learning. In this study, we have used an experimental research design
approach. Facepager and Google Forms were used for data collection. Normalization, data cleaning, and
tokenization were utilized for data preprocessing. We employed two-step approaches for the experimentation.
The primary dataset was used for experimentation using the BERT-pretrained model. To examine and identify
the best performing deep learning techniques in our dataset, a convolutional neural network (CNN), long short term memory (LSTM), bi-directional long short-term memory (BiLSTM), and gated recurrent unit (GRU) were
used and executed. However, overfitting was encountered due to the limited size of our dataset. To address the
overfitting issue within the dataset, methods of cross-validation and L2 regularization were employed. To solve
the scarcity of the trained data, the second approach, the BERT-pretrained model, was applied. The researcher
used the model's accuracy and loss to evaluate the performance of the model. After all the preprocessing
activities and training were performed, the performance of each model was: a convolutional neural network
(CNN) with an accuracy of 98.44% and a loss of 0.0396 and a bidirectional encoder representation from
transformers (BERT) with an accuracy of 98.83% and a loss of 0.0952. Finally, through experimentation, the
BERT model outperformed other algorithms with 98.83% accuracy. The study used Afaan Oromo language
features to detect harassment and hate speech on social media. Future research could use social media data to
create unique word embeddings and assess the CapsNet model's effectiveness on non-textual data.