Afaan Oromo Continuous Speech Recognition Using Deep Learning

Degefa, Sifen Dadi

Afaan Oromo Continuous Speech Recognition Using Deep Learning

Degefa, Sifen Dadi

URI: https://repository.ju.edu.et//handle/123456789/6163

Date: 2021-04-21

Abstract:

Automatic Speech Recognition (ASR) works by taking an audio speech as an input and convert it to text as an output. In this study an attempt is made to design an automatic Afaan Oromo speech to text recognition using the state-of-the-art deep learning algorithm. Accordingly, the study explored the possibilities of developing a continuous speech recognition system for Afaan Oromo. Previous related works on local languages and also for Afaan Oromo was reviewed but there was no any work on Afaan Oromo using deep learning algorithms; all the previous Afaan Oromo ASRs were based on traditional machine learning models. For this thesis, deep bidirectional RNN and CNN/RNN hybrid models have been proposed to show the possibility of developing ASR for local languages and Afaan Oromo using deep learning and to improve the performance of Afaan Oromo continuous speech recognition systems. For the purpose of conducting the experiment towards training, validating, and visualizing the model, Tensor flow, Keras, Jupyter Notebook, PyDub, Matplotlib and Pydot are tools used. The speech corpus was prepared by collecting broadcast news audios from Ethiopian Broadcasting Corporation (EBC), Oromia Broadcasting Network (OBN), Oromia Media Network (OMN), Voice of America (VOA), Fana Broadcasting Corporation (FBC), and BBC Afaan Oromo program. Totally about 8000 utterances from 101 speakers (80 males and 21 females), which have 10:01:38 hours long data set was collected and transcribed. The dataset was used for both training, validation and testing. We trained and evaluated both RNN and CNN/RNN models with connectionist temporal classification CTC to tackle sequence problems. We also tried to adjust learning rate, optimizers, number of neurons and number of layers of the recognizer model according to the available resources so as to increase the performance of the recognizer. Accordingly, multiple experiments were done and CNN/RNN hybrid model was chosen as the best model for our case. Experimental results shown that, the best performance achieved was 69% WER and 16.3 loss by CNN/RNN hybrid model. Even if we get a promising result, from all experiments we understand that an increase in data and use of high performing GPUs for constructing large models could improve the performance of Afaan Oromo deep ASR. So we recommend further study needs to I be conducted with large vocabulary and better GPU to enhance the accuracy of ASR for Afaan Oromo language

Show full item record