Abstract:
Automatic Speech Recognition (ASR) works by taking an audio speech as an input and convert it
to text as an output. In this study an attempt is made to design an automatic Afaan Oromo speech
to text recognition using the state-of-the-art deep learning algorithm. Accordingly, the study
explored the possibilities of developing a continuous speech recognition system for Afaan Oromo.
Previous related works on local languages and also for Afaan Oromo was reviewed but there was
no any work on Afaan Oromo using deep learning algorithms; all the previous Afaan Oromo ASRs
were based on traditional machine learning models. For this thesis, deep bidirectional RNN and
CNN/RNN hybrid models have been proposed to show the possibility of developing ASR for local
languages and Afaan Oromo using deep learning and to improve the performance of Afaan Oromo
continuous speech recognition systems. For the purpose of conducting the experiment towards
training, validating, and visualizing the model, Tensor flow, Keras, Jupyter Notebook, PyDub,
Matplotlib and Pydot are tools used.
The speech corpus was prepared by collecting broadcast news audios from Ethiopian Broadcasting
Corporation (EBC), Oromia Broadcasting Network (OBN), Oromia Media Network (OMN),
Voice of America (VOA), Fana Broadcasting Corporation (FBC), and BBC Afaan Oromo
program. Totally about 8000 utterances from 101 speakers (80 males and 21 females), which have
10:01:38 hours long data set was collected and transcribed. The dataset was used for both training,
validation and testing.
We trained and evaluated both RNN and CNN/RNN models with connectionist temporal
classification CTC to tackle sequence problems. We also tried to adjust learning rate, optimizers,
number of neurons and number of layers of the recognizer model according to the available
resources so as to increase the performance of the recognizer. Accordingly, multiple experiments
were done and CNN/RNN hybrid model was chosen as the best model for our case.
Experimental results shown that, the best performance achieved was 69% WER and 16.3 loss by
CNN/RNN hybrid model. Even if we get a promising result, from all experiments we understand
that an increase in data and use of high performing GPUs for constructing large models could
improve the performance of Afaan Oromo deep ASR. So we recommend further study needs to
I
be conducted with large vocabulary and better GPU to enhance the accuracy of ASR for Afaan
Oromo language