Jimma University Open access Institutional Repository

Classification of Multilingual Under-resourced Language Documents using English Ontology

Show simple item record

dc.contributor.author Tsegay Mullu
dc.date.accessioned 2020-12-30T07:47:40Z
dc.date.available 2020-12-30T07:47:40Z
dc.date.issued 2017-11
dc.identifier.uri https://repository.ju.edu.et//handle/123456789/4530
dc.description.abstract Automatic documents classification is an important task due to the rapid growth of the number of electronic documents. Classification aims to assign the document to a predefined category automatically based on its contents. In general, text classification plays an important role in information extraction and summarization, text retrieval, question answering, e-mail spam detection, web page content filtering, and automatic message routing. Most existing methods and techniques in the field of document classification are keyword based without many features. Due to lack of semantic consideration of this technique it is outperformed by ontology based text categorization approach. However, it is very challenging of building ontology with under-resourced language, ontology-based classification is limited to English language support. Hence, under-resourced written documents are not benefited from ontology based text classification. In this research, we propose an approach that can classify under-resourced language written documents on top of resourced language ontology. Beside this, the proposed approach also is capable of classifying multilingual documents (i.e. Amharic, Afaan Oromo and Tigrinya textual documents) on top of English ontology. Furthermore, in order to show the practicality of the proposed approach a prototype is developed using a java framework. To evaluate the performance of the proposed approach 20 test documents for Amharic and Tigrinya and 15 test document for Afaan Oromo in each news category is used. In order to observe the effect of incorporated features (i.e. lemma based index term selection, pre-processing strategies (i.e. stopword removal and stemming) during concept mapping and semantical based concept mapping) in the proposed document classifier four experimental techniques were conducted. The experiments were evaluated using Recall, Precision and F-measure in order to observe the impact of the proposed approach in the improvement of document classification process. The experimental results show that the proposed document classifier with incorporation of all features and components achieved the average F-measure of 92.37%, 86.07% and 88.12% for Amharic, Afaan Oromo and Tigrinya documents respectively. These results proved that the proposed approach contributes effectively in the process of classifying underresourced written documents (i.e. Amharic, Afaan Oromo and Tigrinya documents) on top of resourced language ontology (i.e. English ontology). To enhance the effectiveness of the proposed approach the researcher recommends enhancing the size and quality of bilingual dictionary, and enhancing the performance of part of speech tagging and morphological analyzer. en_US
dc.language.iso en en_US
dc.subject Multilingual en_US
dc.subject Text Mining en_US
dc.subject Documents or text Classification en_US
dc.subject News Ontology en_US
dc.title Classification of Multilingual Under-resourced Language Documents using English Ontology en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search IR


Browse

My Account