Short Amharic Text Clustering Using Topic Modeling

Kebede Assefa; Melkamu Beyene; Ephrem Tadesse

Short Amharic Text Clustering Using Topic Modeling

Kebede Assefa; Melkamu Beyene; Ephrem Tadesse

URI: https://repository.ju.edu.et//handle/123456789/5526

Date: 2020-09

Abstract:

Text clustering is to group texts according to a certain feature defined on texts to measure the similarity between two texts. keyword-based models like TFIDF of a model for texts have been used as a feature in recent works. Key-word based approach is not feasible for short text due to the texts have only few words. Not only this but also it lacks semantic structure which limits further analysis of texts. The topic model has been developed to discover probabilistic distributions of topics over some fixed set of keywords/vocabulary. Unlike the TFIDF topic model has a semantic structure of texts. The topic model is able to cluster not only using ids but also the topic of cluster. In this thesis work, we have used topic modeling to discover latent/hidden topics from a collection of short texts through machine learning. Currently, Latent Dirichlet Allocation (LDA) is a popular and widely used topic modeling approach. We have implemented the proposed model in python with LDA library tool. After LDA find the hidden/latent topics from the given text we have saved the identified topics as feature. The saved feature and test set similarity has been calculated to identify the cluster id of test set text. We have investigated the LDA method approach to cluster short Amharic texts with and without word embedding as feature extraction. To evaluate the result, we have collected several short Amharic texts from different local news agencies’ websites that contain different groups of categories. The experimental result shows that LDA without word embedding performs 90% of accuracy while LDA with word embedding as feature extraction has an accuracy of 97.17%.

Show full item record