Corpus Based Word Sense Disambiguation For Geez Language

Aschale, Amlakie; Anlay, Kinde; Abdurahman, Fetulhak

Corpus Based Word Sense Disambiguation For Geez Language

Aschale, Amlakie; Anlay, Kinde; Abdurahman, Fetulhak

URI: https://repository.ju.edu.et//handle/123456789/6924

Date: 2020-03-30

Abstract:

1st century,Improving several natural language applications could be used to make the com munication between human beings and computers easy. Words in one language may have two or more meanings depending on the contexts that we use. Those words make the communica tion between computers and humans difficult because computers cannot differentiate or identify proper meanings of ambiguos words. So that, Word Sense Disambiguation(WSD) helps com puters to identify the related meaning of the ambiguous words depending on the surrounding contexts. In this study, we tried to build WSD prototype model for Geez language because WSD is an intermediate task for other NLP tasks like information extraction, machine trans lation, speech recognition and information retrieval.There are three types of WSD approches; hybrid,knowldge based and corpus based approaches.From those approaches, we used a corpus based approach to build the WSD model and it can be further classified as supervised, semi supervised, and unsupervised machine learning methods. We conducted our experiments on six ambiguous words of Geez language by collecting a total of 2119 sentences or instances of the language.Those six ambiguous words of the language are:- ሀለፈ (Halafe), ቆመ (ḱome), ባረከ (bareke), አስተርዓየ(astaraya), ገብረ(gebira), ሰዓለ (Se’ale).We applied four clustering algo- rithms (EM, Simple K-Means, Farthest First, and Hierarchical Clusterer) and five classification algorithms (ADTree,AdaBoostM1,SMO, Bagging and Naïve Bayes) for clustering and classifi- cation purposes of the sentences. We compared the Corpus-based machine learning approachs, and we found that semi-supervised machine learning approach achieved the best performance. The proposed method achieved an average performance of 92.1%, 91.3%, 91% and 91.1% for Precision, Recall, F1-score and Accuracy using ADTree algorithm respectively. Window size of 4-4 has been the optimal window size to identify the meaning of the selected ambiguous words of Geez language using ADTree algorithm

Show full item record