Jimma University Open access Institutional Repository

Short Amharic Text Clustering Using Topic Modeling

Show simple item record

dc.contributor.author Kebede Assefa
dc.contributor.author Melkamu Beyene
dc.contributor.author Ephrem Tadesse
dc.date.accessioned 2021-02-11T07:00:20Z
dc.date.available 2021-02-11T07:00:20Z
dc.date.issued 2020-09
dc.identifier.uri https://repository.ju.edu.et//handle/123456789/5526
dc.description.abstract Text clustering is to group texts according to a certain feature defined on texts to measure the similarity between two texts. keyword-based models like TFIDF of a model for texts have been used as a feature in recent works. Key-word based approach is not feasible for short text due to the texts have only few words. Not only this but also it lacks semantic structure which limits further analysis of texts. The topic model has been developed to discover probabilistic distributions of topics over some fixed set of keywords/vocabulary. Unlike the TFIDF topic model has a semantic structure of texts. The topic model is able to cluster not only using ids but also the topic of cluster. In this thesis work, we have used topic modeling to discover latent/hidden topics from a collection of short texts through machine learning. Currently, Latent Dirichlet Allocation (LDA) is a popular and widely used topic modeling approach. We have implemented the proposed model in python with LDA library tool. After LDA find the hidden/latent topics from the given text we have saved the identified topics as feature. The saved feature and test set similarity has been calculated to identify the cluster id of test set text. We have investigated the LDA method approach to cluster short Amharic texts with and without word embedding as feature extraction. To evaluate the result, we have collected several short Amharic texts from different local news agencies’ websites that contain different groups of categories. The experimental result shows that LDA without word embedding performs 90% of accuracy while LDA with word embedding as feature extraction has an accuracy of 97.17%. en_US
dc.language.iso en en_US
dc.subject Topic Modeling en_US
dc.subject Text Clustering en_US
dc.subject Latent Dirichlet Allocation(LDA) en_US
dc.title Short Amharic Text Clustering Using Topic Modeling en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search IR


Browse

My Account