Abstract:
This study presents word sequence prediction Language model for Kafi-Nono words.
Text generation, in particular, next-word prediction, is convenient for users because it
helps to type without errors and faster. Therefore, a personalized text prediction sys tem is a vital analysis topic for all languages, primarily for Kafi-Nono, because of limited
support for the Kafa language tools. Language model (LM) gives probability of how likely
a sequence of words might appear in a particular order in a sentence, and they are an
essential part of word prediction and natural language processing (NLP) systems. Lan guage models have significantly advanced after the invention of Neural Network Language
Models (NNLMs) called Transformers. Transformers have become the state-of-the-art lan guage modeling tools for many NLP tasks because of their superior performance compared
to N-gram models. Similarly, word prediction systems have improved at a considerable
pace over the past decade. Though Kafi-Nono is spoken by a significant number of peo ple, there is no word prediction system developed sofar for the language; this thesis is a
first attempt to develop a word prediction language model for Kafi-Nono. We have applied
cross-lingual transfer technique on the latest deep learning transformer model to develop
the system. The main objective of this study is to develop word prediction language model
that can predict next word/phrase for Kafi-Nono. The corpus was collected from books,
news, cultural documents, and history of the language speaking societies for the sake
of this study only. In order to develop the model, we have used unsupervised machine
learning method called transfer learning. The transformer model used in this work is
Generative Predictive Transformer(GPT2), which is made of 12 layers of NN, 768 Hidden
units per layer, 768 Code context length, 768 Embedding dimension and 12 Attention
heads. Transfer learning helped to overcome the problem of data scarcity for this lan guage and enabled us to adapt the power of neural networks to get reasonable result with
less effort. The idea behind the approach is to overcome the problem of a low-resources
of the language, and handle the vast morphological inflection behaviour and scarcity of
training data for Kafi-Nono by using unsupervised approach. This makes our approach
effective for prediction problems where there is lack of resources for the language. For our
experiment we have divided the dataset in to training and evaluation parts each with a
dataset of 80% and 20% respectively.A separate testing data is also prepared to evaluate
the final model. To evaluate the performance of our model, we have implemented two
types of evaluation metrics; human(extrinsic) evaluation and automatic(intrinsic) evalu ation called perplexity. The result claimed that our model yields an accuracy of 89% in
human evaluation and 4.7 in perplexity. The achieved result was encouraging;. However;
we trained our model with a mix of data from all dialects and also tonal features are not
treated in our data.So, handling this two problems in the data can bring better result.