Jimma University Open access Institutional Repository

Statistical Topic Modeling for Afaan Oromo Document Clustering using Latent Dirichlet Allocation (LDA)

Show simple item record

dc.contributor.author Fikadu Wayesa
dc.contributor.author Million Meshesha
dc.contributor.author Kibret Zewde
dc.date.accessioned 2021-02-05T12:38:41Z
dc.date.available 2021-02-05T12:38:41Z
dc.date.issued 2019
dc.identifier.uri https://repository.ju.edu.et//handle/123456789/5398
dc.description.abstract The plenty of digital data poses challenge to understand and utilize the overwhelming amount of information. Manually reading large data for content analysis is inefficient. The amount of information available in digital form is getting double which is leading to the information overload almost in all languages. So, in Machine Learning Topic modeling has been accepted as a powerful technique for the comprehension of content analysis. In this study, we used Latent Dirichlet Allocation (LDA) based topic modeling for analyzing Afaan Oromo text documents to extract appropriate topic tags from the collection. It relies on unsupervised machine learning techniques to extract topics from document collection, by generating probabilistic word-topic and topic-document associations from the latent topics in the text using hidden random variables. We combined word embedding approach to capture semantic structure of words how they are semantically correlated to each other with LDA algorithm to improve the quality of extracted topics since the LDA suffers from the bag-of-model approach. We used a collection of 16 documents from 4 different categories (Health, Education, Sport and Weather condition). Clean text corpus and estimating parameter settings that generates interpretable topics are important prerequisites for acquiring a valid interpretation of topics from the algorithm. The experiments include all necessary steps of data collection, pre-processing, model fitting and an application of document exploring. We used Gibbs sampling to estimate the topic and word distributions. Experiments are carried out to confirm the topic extraction effect of this algorithm. The clustering of documents and exploration was done by the LDA based on the generated topics. Our study used three evaluation metrics; Perplexity to select better LDA model parameters, Topic Coherence is used to evaluate the coherence of topics and human judgment for topics interpretability. An average accuracy of Perplexity score of -9.775 was estimated with number of 10 topics from different values of K topics (We used minimum K to 2 and maximum 20), Topic coherence of PMI with 52.5% was scored, and overall human judgement of 66% of F Measure. en_US
dc.language.iso en en_US
dc.subject Topic Modeling en_US
dc.subject LDA en_US
dc.subject Statistical Modeling en_US
dc.subject Topic Extraction en_US
dc.subject Latent Topics en_US
dc.subject Big Data en_US
dc.subject Afaan Oromo en_US
dc.title Statistical Topic Modeling for Afaan Oromo Document Clustering using Latent Dirichlet Allocation (LDA) en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search IR


Browse

My Account