Classification of Multilingual Under-resourced Language Documents using English Ontology

Tsegay Mullu

Classification of Multilingual Under-resourced Language Documents using English Ontology

Tsegay Mullu

URI: https://repository.ju.edu.et//handle/123456789/4530

Date: 2017-11

Abstract:

Automatic documents classification is an important task due to the rapid growth of the number of electronic documents. Classification aims to assign the document to a predefined category automatically based on its contents. In general, text classification plays an important role in information extraction and summarization, text retrieval, question answering, e-mail spam detection, web page content filtering, and automatic message routing. Most existing methods and techniques in the field of document classification are keyword based without many features. Due to lack of semantic consideration of this technique it is outperformed by ontology based text categorization approach. However, it is very challenging of building ontology with under-resourced language, ontology-based classification is limited to English language support. Hence, under-resourced written documents are not benefited from ontology based text classification. In this research, we propose an approach that can classify under-resourced language written documents on top of resourced language ontology. Beside this, the proposed approach also is capable of classifying multilingual documents (i.e. Amharic, Afaan Oromo and Tigrinya textual documents) on top of English ontology. Furthermore, in order to show the practicality of the proposed approach a prototype is developed using a java framework. To evaluate the performance of the proposed approach 20 test documents for Amharic and Tigrinya and 15 test document for Afaan Oromo in each news category is used. In order to observe the effect of incorporated features (i.e. lemma based index term selection, pre-processing strategies (i.e. stopword removal and stemming) during concept mapping and semantical based concept mapping) in the proposed document classifier four experimental techniques were conducted. The experiments were evaluated using Recall, Precision and F-measure in order to observe the impact of the proposed approach in the improvement of document classification process. The experimental results show that the proposed document classifier with incorporation of all features and components achieved the average F-measure of 92.37%, 86.07% and 88.12% for Amharic, Afaan Oromo and Tigrinya documents respectively. These results proved that the proposed approach contributes effectively in the process of classifying underresourced written documents (i.e. Amharic, Afaan Oromo and Tigrinya documents) on top of resourced language ontology (i.e. English ontology). To enhance the effectiveness of the proposed approach the researcher recommends enhancing the size and quality of bilingual dictionary, and enhancing the performance of part of speech tagging and morphological analyzer.

Show full item record