Abstract:
More than 1-2 million people in Ethiopia are estimated to be deaf or hard of hearing, according to the
Ethiopian National Association for the Deaf and a 2019 report from the Department of Linguistics at
Addis Ababa University on empowering the Deaf in Africa. For those people, the only means of
communication is sign language. Sign language is a form of communication used by people with
impaired hearing and speech. These hearing- impaired communities can communicate by using sign
language. Even if it is widely used in the hearing-impaired community, they struggle to communicate with
hearing people due to the language barrier. Due to this communication gap hearing-impaired people
encounter so many problems in their daily life since they are living with people who communicate in
spoken languages. Unfortunately, few people have knowledge of sign language in our daily life. In
general, interpreters can help us to communicate with these challenges, but it is expensive to employ
interpreters on personal behalf and inconvenient when privacy is required. Consequently, it is very
important to develop a system that fills the communication gap between the hearing impaired and
hearing people. To address this problem, many researchers have studied the recognition of Ethiopian
Sign Language, but they are mainly restricted to studying word-level (Isolated) recognition. A few
researchers attempted to study sentence-level recognition using various techniques, but their results
revealed a signer-dependent issue as well as insufficient system accuracy. Therefore, the researcher
proposed Ethiopian Sign Language Recognition from Video Sequences by using pretrained CNN and RNN
Models, which recognizes continuous gestures from a video stream performed by different signers.The
main focus of this work is to create a vision-based continuous sign language recognition system to identify
Ethiopian sign language gestures from video sequences using CNN and RNN models. The proposed
model is composed of three major processes: preprocessing (hand, pose, and face landmark detection
with mediapipe holistic), feature extraction with the CNN model, feature learning, and classification with
LSTM. In the feature extraction phase, the characteristic features are extracted, and the distinguishing
features are learned in the feature learning phase by applying different operations such as convolution,
pooling, and an activation layer. For feature learning, in our study, we have applied a Bidirectional Long Short-Term Memory (BiLSTM) model. Hosanna Deaf School and Jimma Zone Disability Center
provided us with the data for our experiment. Our data set consists of continuous gestures, with
around 300 videos belonging to 5 gesture categories performed by different signers. We extract each
video to frames using the OpenCV library and pass each frame through mediapipe holistic for
prepossessing.The proposed system is implemented using Keras, TensorFlow on Google collab. We present
continuous gesture recognition by combining two different neural network architectures. In the first case,
CNN and RNN are used. The CNN model was retrained on the VGG16 model. In the second architecture,
there is a CNN followed by a GRU. The two models achieved 85% and 70% accuracy, respectively. Our
study explored the ability of these two architectures to recognize 5 daily usage sentences. Due to the fact
that some gesture signs begin with the same movement and others overlap in the middle, the model is
confused when it attempts to recognize continuous gesture signs. To improve system accuracy, more
data collection with high resolution cameras is required, along with the application of different holistic
algorithms.