Abstract:
Automatic Speech recognition(ASR) is a system that translates spoken language into written text.
The purpose of this study was to develop an Afaan Oromo speech-to-text recognition model using
Speech Transformer, a state-of-the-art deep learning algorithm. Our study primarily aimed at
developing a large vocabulary continuous speech recognition model for the Afaan Oromo language
with the latest deep learning named Speech Transformer. Most studies in Afaan Oromo ASR used
classical machine learning models and only one tried combining Recurrent Neural Networks
(RNNs) with Convolutional Neural Networks (CNNs) as a hybrid approach. However, these
existing techniques had difficulties in accurately transcribing varied and complex speaking styles
found in Afaan Oromo and suffered from slow training due to the lack of parallelization during
training time. We have put into consideration these limitations by taking advantage of the powerful
non-recurrent sequence to sequence learning capabilities inherent in the architecture of the Speech
Transformer. Unlike recurrent-based approaches, the Speech Transformer model can efficiently
process all input time series data in parallel, which enables faster training compared to sequential
processing methods. This parallelization was incredibly advantageous because even using Google
Colab resources to conduct the training, the computational constraints were real.. The speech
corpus was prepared by collecting broadcast news audios from various Afaan Oromo media
sources, totaling 8729 utterances from 100 speakers (50 males and 50 females) for a dataset of
18.04 hours. We experimented with four different models, varying the number of encoders,
decoders, and feed-forward neural networks (FFNN). The best-performing model, with five
encoders, three decoders, and 400 FFNN, achieved a word error rate (WER) of 40.2%. While this
represents a promising result, we acknowledge that further improvements could be achieved by
increasing the dataset size and using high-performance GPUs to enable the construction of larger
and more complex models. In future work, we recommend conducting further studies with larger
vocabularies and better computational resources to continue advancing the state-of-the-art in
Afaan Oromo speech recognition. Additionally, we plan to explore the use of language models to
further enhance the accuracy and robustness of the Afaan Oromo ASR system.