Abstract:
Named Entity Recognition (NER) is an essential task for many applications in Natural
Language Processing (NLP), which aims to make computers capable of comprehending and
processing human language. This study addresses the challenge of developing an NER system
for Kafinoonoo, a low-resource language spoken in Ethiopia, which has limited linguistic
resources and NLP infrastructure. The main objective was to develop and assess a deep
learning-based NER system that was specifically tailored to the Kafinoonoo language's
characteristics. In this study, we explored combinations of input representation methods,
context encoders, and tag decoders to determine the most effective architecture. The input
representations included traditional word embeddings such as Word2Vec and FastText, as well
as transformer-based language models like BERT and RoBERTa, which provided richer,
context-sensitive representations. Recurrent neural networks, specifically BiLSTM and
BiGRU, were utilized as context encoders to capture language patterns, while Softmax and
Conditional Random Fields (CRF) served as tag decoders for classifying named entities. The
results demonstrated that models employing transformer-based representations with recurrent
encoders and CRF decoders consistently outperformed other configurations. Notably, the
combination of RoBERTa with BiLSTM and CRF achieved the highest F1 score of 0.90,
underscoring the effectiveness of advanced architectures and CRF in enhancing tagging
accuracy. This research enhances Kafinoonoo language preservation, search engine
efficiency, and information extraction, promoting digital literacy and cultural preservation
within the Kafecho community. It addresses the unique linguistic difficulties of Kafinoonoo and
offers a strong basis for developing NER systems for low-resource languages. The work has
constraints that might impact generalizability despite its contributions, such as a limited
dataset size and a lack of thorough annotated corpora. In order to enhance performance, future
work should concentrate on growing annotated datasets, utilizing semi-supervised and
unsupervised learning approaches, and investigating data augmentation strategies.