Jimma University Open access Institutional Repository

General Purpose Language Identification for Ethiopia Semitic Language using Hybrid Approach

Show simple item record

dc.contributor.author Kidest Erigetie
dc.contributor.author Yaregal Assabi
dc.date.accessioned 2021-02-04T11:02:13Z
dc.date.available 2021-02-04T11:02:13Z
dc.date.issued 2017
dc.identifier.uri https://repository.ju.edu.et//handle/123456789/5372
dc.description.abstract Due to many sophisticated and advanced technologies like the Internet, the world has become a single village. It is possible to get a vast amount of digitized information that are generated, propagated, exchanged, stored and accessed through the internet and other media like mobile network each day across the world. The accumulation of digital data is making information acquisition increasingly difficult, with natural language becoming critically an obstacle. The step towards tackling this obstacle is Natural Language Processing and language identification is the first step among many steps that are used for information acquisition and other advanced NLP applications. It is a technique of labeling each word in a text or sentence with its corresponding language category. In past decades a number of research works have been done in the area of language identification. However, there are issues which are not solved until: multilingual language identification, discriminating the language category of very closely related language documents and labelling the language category for very short texts like words or phrases. In addition to this, as far as the researcher’s knowledge is concerned, there is no language identifier developed for Ethiopian Semitic language though there are many language identifier developed using different approaches for many European languages and resourced languages. In this investigation, we propose a hybrid approach; character ngram and word ngram combined with rule based approach. Which can able to solve these mentioned unsolved issues of language identification on top of Ethiopian Semitic languages (i.e. Amharic, Geeze, Guragigna and Tigrigna). The proposed general purpose language identifier approach has a capability of identify the language of a text at any level (i.e. Word, phrase, sentence and document) in both monolingual as well as multilingual setting. The reason behind this capability of proposed approach is due to the features of word level language identification, in which every words needs to classify with regard to its language category at a time. Text is first pass through preprocessing steps. Then pass through rule based approach word which can handle through rule. Afterwards word ngram of previse word language is conducts, if word not exist, Character ngram (infinite ngram) with location is calculated; afterwards the ngram probability is calculated and ngram probability of word is calculated, which is used to assign a language label for that word. Finally sentence and document reformation is done for all texts. ii The system was developed using Java programming and the performance of the system has been evaluated using 10-fold cross-validation technique. For training and testing purpose 27 Mb data from different sources (news, bible and books) were used. Beside this, the effectiveness and performance of the proposed language identifier is evaluated using precision, recall and Fmeasure evolution metrics. Different experiments are conducted for hybrid of character ngram, rule based and word ngram based approaches using monolingual texts. The hybrid of fixed size character ngram with location, word ngram and rule based approach shows an average Fmeasure of 70.39%, 76.95 % 4, 73.69 % and 78.98% for Amharic, Geez, Guragigna and Tigrigna respectively. The hybrid of infinite ngram with location, word ngram and rule based approach shows an average F-measure of 83.57%, 84.53%, 86.67% and 87.44% for Amharic, Geez, Guragigna and Tigrigna respectively. Whereas, the hybrid model (adding sentence) improve the accuracy to 99.85%, 99.74%, 100% and 99.93% for Amharic, Geez, Guragigna and Tigrigna respectively. Adding sentence and document reformation improves the performance in to 100% for word, phrase, and sentence and document level in a monolingual setting. As well, for multilingual setting also attains an average F-measure of 100% for both sentence level and document level test, but for phrase level achieves an average F-measure of 82.64%, 86.38%, 87.19% and 86.81% For Amharic, Geeze, Guragigna and Tigrigna respectively. Hence, it is found that adding sentence level and document level reformation in to the hybrid of infinity ngram with location feature set is a best combination of proposed general purpose language identifier. en_US
dc.language.iso en en_US
dc.subject language identification en_US
dc.subject multilingual en_US
dc.subject monolingual en_US
dc.subject Naïve Bayes en_US
dc.subject ngram en_US
dc.subject closely related language en_US
dc.subject ngram location en_US
dc.subject word level en_US
dc.subject infinity ngram en_US
dc.subject fixed length character ngram en_US
dc.title General Purpose Language Identification for Ethiopia Semitic Language using Hybrid Approach en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search IR


Browse

My Account