Statistical Topic Modeling for Afaan Oromo Document Clustering using Latent Dirichlet Allocation (LDA)

Fikadu Wayesa; Million Meshesha; Kibret Zewde

dc.contributor.author	Fikadu Wayesa
dc.contributor.author	Million Meshesha
dc.contributor.author	Kibret Zewde
dc.date.accessioned	2021-02-05T12:38:41Z
dc.date.available	2021-02-05T12:38:41Z
dc.date.issued	2019
dc.identifier.uri	https://repository.ju.edu.et//handle/123456789/5398
dc.description.abstract	The plenty of digital data poses challenge to understand and utilize the overwhelming amount of information. Manually reading large data for content analysis is inefficient. The amount of information available in digital form is getting double which is leading to the information overload almost in all languages. So, in Machine Learning Topic modeling has been accepted as a powerful technique for the comprehension of content analysis. In this study, we used Latent Dirichlet Allocation (LDA) based topic modeling for analyzing Afaan Oromo text documents to extract appropriate topic tags from the collection. It relies on unsupervised machine learning techniques to extract topics from document collection, by generating probabilistic word-topic and topic-document associations from the latent topics in the text using hidden random variables. We combined word embedding approach to capture semantic structure of words how they are semantically correlated to each other with LDA algorithm to improve the quality of extracted topics since the LDA suffers from the bag-of-model approach. We used a collection of 16 documents from 4 different categories (Health, Education, Sport and Weather condition). Clean text corpus and estimating parameter settings that generates interpretable topics are important prerequisites for acquiring a valid interpretation of topics from the algorithm. The experiments include all necessary steps of data collection, pre-processing, model fitting and an application of document exploring. We used Gibbs sampling to estimate the topic and word distributions. Experiments are carried out to confirm the topic extraction effect of this algorithm. The clustering of documents and exploration was done by the LDA based on the generated topics. Our study used three evaluation metrics; Perplexity to select better LDA model parameters, Topic Coherence is used to evaluate the coherence of topics and human judgment for topics interpretability. An average accuracy of Perplexity score of -9.775 was estimated with number of 10 topics from different values of K topics (We used minimum K to 2 and maximum 20), Topic coherence of PMI with 52.5% was scored, and overall human judgement of 66% of F Measure.	en_US
dc.language.iso	en	en_US
dc.subject	Topic Modeling	en_US
dc.subject	LDA	en_US
dc.subject	Statistical Modeling	en_US
dc.subject	Topic Extraction	en_US
dc.subject	Latent Topics	en_US
dc.subject	Big Data	en_US
dc.subject	Afaan Oromo	en_US
dc.title	Statistical Topic Modeling for Afaan Oromo Document Clustering using Latent Dirichlet Allocation (LDA)	en_US
dc.type	Thesis	en_US