Abstract:
Text clustering is to group texts according to a certain feature defined on texts
to measure the similarity between two texts. keyword-based models like TFIDF
of a model for texts have been used as a feature in recent works. Key-word
based approach is not feasible for short text due to the texts have only few
words. Not only this but also it lacks semantic structure which limits further
analysis of texts. The topic model has been developed to discover probabilistic
distributions of topics over some fixed set of keywords/vocabulary. Unlike the
TFIDF topic model has a semantic structure of texts. The topic model is able to
cluster not only using ids but also the topic of cluster.
In this thesis work, we have used topic modeling to discover latent/hidden
topics from a collection of short texts through machine learning. Currently,
Latent Dirichlet Allocation (LDA) is a popular and widely used topic modeling approach. We have implemented the proposed model in python with LDA
library tool. After LDA find the hidden/latent topics from the given text we
have saved the identified topics as feature. The saved feature and test set similarity has been calculated to identify the cluster id of test set text. We have
investigated the LDA method approach to cluster short Amharic texts with and
without word embedding as feature extraction. To evaluate the result, we have
collected several short Amharic texts from different local news agencies’ websites that contain different groups of categories. The experimental result shows
that LDA without word embedding performs 90% of accuracy while LDA with
word embedding as feature extraction has an accuracy of 97.17%.