Abstract:
Due to many sophisticated and advanced technologies like the Internet, the world has become a
single village. It is possible to get a vast amount of digitized information that are generated,
propagated, exchanged, stored and accessed through the internet and other media like mobile
network each day across the world. The accumulation of digital data is making information
acquisition increasingly difficult, with natural language becoming critically an obstacle. The step
towards tackling this obstacle is Natural Language Processing and language identification is the
first step among many steps that are used for information acquisition and other advanced NLP
applications. It is a technique of labeling each word in a text or sentence with its corresponding
language category. In past decades a number of research works have been done in the area of
language identification. However, there are issues which are not solved until: multilingual
language identification, discriminating the language category of very closely related language
documents and labelling the language category for very short texts like words or phrases. In
addition to this, as far as the researcher’s knowledge is concerned, there is no language identifier
developed for Ethiopian Semitic language though there are many language identifier developed
using different approaches for many European languages and resourced languages.
In this investigation, we propose a hybrid approach; character ngram and word ngram combined
with rule based approach. Which can able to solve these mentioned unsolved issues of language
identification on top of Ethiopian Semitic languages (i.e. Amharic, Geeze, Guragigna and
Tigrigna). The proposed general purpose language identifier approach has a capability of identify
the language of a text at any level (i.e. Word, phrase, sentence and document) in both
monolingual as well as multilingual setting. The reason behind this capability of proposed
approach is due to the features of word level language identification, in which every words needs
to classify with regard to its language category at a time. Text is first pass through preprocessing
steps. Then pass through rule based approach word which can handle through rule. Afterwards
word ngram of previse word language is conducts, if word not exist, Character ngram (infinite
ngram) with location is calculated; afterwards the ngram probability is calculated and ngram
probability of word is calculated, which is used to assign a language label for that word. Finally
sentence and document reformation is done for all texts.