Afaan Oromo Large Vocabulary Continuous Speech Recognition Model

ABDISA, GELANA; MAMO, GETACHEW; TAKELE, HABTAMU

Afaan Oromo Large Vocabulary Continuous Speech Recognition Model

ABDISA, GELANA; MAMO, GETACHEW; TAKELE, HABTAMU

URI: https://repository.ju.edu.et//handle/123456789/9266

Date: 2024-06-17

Abstract:

Automatic Speech recognition(ASR) is a system that translates spoken language into written text. The purpose of this study was to develop an Afaan Oromo speech-to-text recognition model using Speech Transformer, a state-of-the-art deep learning algorithm. Our study primarily aimed at developing a large vocabulary continuous speech recognition model for the Afaan Oromo language with the latest deep learning named Speech Transformer. Most studies in Afaan Oromo ASR used classical machine learning models and only one tried combining Recurrent Neural Networks (RNNs) with Convolutional Neural Networks (CNNs) as a hybrid approach. However, these existing techniques had difficulties in accurately transcribing varied and complex speaking styles found in Afaan Oromo and suffered from slow training due to the lack of parallelization during training time. We have put into consideration these limitations by taking advantage of the powerful non-recurrent sequence to sequence learning capabilities inherent in the architecture of the Speech Transformer. Unlike recurrent-based approaches, the Speech Transformer model can efficiently process all input time series data in parallel, which enables faster training compared to sequential processing methods. This parallelization was incredibly advantageous because even using Google Colab resources to conduct the training, the computational constraints were real.. The speech corpus was prepared by collecting broadcast news audios from various Afaan Oromo media sources, totaling 8729 utterances from 100 speakers (50 males and 50 females) for a dataset of 18.04 hours. We experimented with four different models, varying the number of encoders, decoders, and feed-forward neural networks (FFNN). The best-performing model, with five encoders, three decoders, and 400 FFNN, achieved a word error rate (WER) of 40.2%. While this represents a promising result, we acknowledge that further improvements could be achieved by increasing the dataset size and using high-performance GPUs to enable the construction of larger and more complex models. In future work, we recommend conducting further studies with larger vocabularies and better computational resources to continue advancing the state-of-the-art in Afaan Oromo speech recognition. Additionally, we plan to explore the use of language models to further enhance the accuracy and robustness of the Afaan Oromo ASR system.

Show full item record