Abstract:
The plenty of digital data poses challenge to understand and utilize the overwhelming amount of
information. Manually reading large data for content analysis is inefficient. The amount of
information available in digital form is getting double which is leading to the information
overload almost in all languages. So, in Machine Learning Topic modeling has been accepted as
a powerful technique for the comprehension of content analysis. In this study, we used Latent
Dirichlet Allocation (LDA) based topic modeling for analyzing Afaan Oromo text documents to
extract appropriate topic tags from the collection. It relies on unsupervised machine learning
techniques to extract topics from document collection, by generating probabilistic word-topic
and topic-document associations from the latent topics in the text using hidden random variables.
We combined word embedding approach to capture semantic structure of words how they are
semantically correlated to each other with LDA algorithm to improve the quality of extracted
topics since the LDA suffers from the bag-of-model approach. We used a collection of 16
documents from 4 different categories (Health, Education, Sport and Weather condition). Clean
text corpus and estimating parameter settings that generates interpretable topics are important
prerequisites for acquiring a valid interpretation of topics from the algorithm. The experiments
include all necessary steps of data collection, pre-processing, model fitting and an application of
document exploring. We used Gibbs sampling to estimate the topic and word distributions.
Experiments are carried out to confirm the topic extraction effect of this algorithm. The
clustering of documents and exploration was done by the LDA based on the generated topics.
Our study used three evaluation metrics; Perplexity to select better LDA model parameters,
Topic Coherence is used to evaluate the coherence of topics and human judgment for topics
interpretability. An average accuracy of Perplexity score of -9.775 was estimated with number
of 10 topics from different values of K topics (We used minimum K to 2 and maximum 20),
Topic coherence of PMI with 52.5% was scored, and overall human judgement of 66% of F
Measure.