Abstract:
Automatic documents classification is an important task due to the rapid growth of the number of
electronic documents. Classification aims to assign the document to a predefined category
automatically based on its contents. In general, text classification plays an important role in
information extraction and summarization, text retrieval, question answering, e-mail spam detection,
web page content filtering, and automatic message routing. Most existing methods and techniques in
the field of document classification are keyword based without many features. Due to lack of
semantic consideration of this technique it is outperformed by ontology based text categorization
approach. However, it is very challenging of building ontology with under-resourced language,
ontology-based classification is limited to English language support. Hence, under-resourced written
documents are not benefited from ontology based text classification.
In this research, we propose an approach that can classify under-resourced language written
documents on top of resourced language ontology. Beside this, the proposed approach also is capable
of classifying multilingual documents (i.e. Amharic, Afaan Oromo and Tigrinya textual documents)
on top of English ontology. Furthermore, in order to show the practicality of the proposed approach a
prototype is developed using a java framework. To evaluate the performance of the proposed
approach 20 test documents for Amharic and Tigrinya and 15 test document for Afaan Oromo in
each news category is used. In order to observe the effect of incorporated features (i.e. lemma based
index term selection, pre-processing strategies (i.e. stopword removal and stemming) during concept
mapping and semantical based concept mapping) in the proposed document classifier four
experimental techniques were conducted. The experiments were evaluated using Recall, Precision
and F-measure in order to observe the impact of the proposed approach in the improvement of
document classification process. The experimental results show that the proposed document classifier
with incorporation of all features and components achieved the average F-measure of 92.37%,
86.07% and 88.12% for Amharic, Afaan Oromo and Tigrinya documents respectively. These results
proved that the proposed approach contributes effectively in the process of classifying underresourced written documents (i.e. Amharic, Afaan Oromo and Tigrinya documents) on top of
resourced language ontology (i.e. English ontology). To enhance the effectiveness of the proposed
approach the researcher recommends enhancing the size and quality of bilingual dictionary, and
enhancing the performance of part of speech tagging and morphological analyzer.