Abstract:
Text classification is the process of categorizing documents based on their content
into a predefined set of categories. Text classification algorithms typically represent
documents as collections of words and it deals with a large number of features. The
selection of appropriate features becomes important when the initial feature set is
quite large. In this paper, we present a hybrid of document frequency (DF) and
genetic algorithm (GA)-based feature selection method for Amharic text
classification. We evaluate this feature selection method on Amharic news
documents obtained from the Ethiopian News Agency (ENA). The number of
categories used in this study is 13. Our experimental results showed that the proposed
feature selection method outperformed other feature selection methods utilized for
Amharic news document classification. Combining the proposed feature selection
method with Extra Tree Classifier (ETC) improves classification accuracy. It
improves classification accuracy up to 1% higher than the hybrid of DF, information
gain (IG), chi-square (CHI), and principal component analysis (PCA), 2.47% greater
than GA and 3.86% greater than a hybrid of DF, IG, and CHI.