Abstract:
In this research, End to End Named Entity Recognition and Disambiguation problem is
addressed by employing a supervised machine learning approach. Feature extraction algorithm
had developed; avoiding dependency of Named entity on other natural language processing
tasks for classification features. In this paper feature information represented as word vectors
are generated from unlabeled Afaan Oromo text. These generated features are used as features
for Afaan Oromo Named entity classification and Named entity disambiguation similarity
measurements.
A corpus of 10000 sentence had been collected and annotated for Named entity recognition.
Word embedding had trained for this paper from 4 million sentences. Knowledge base of 1000
unique entities with their context had been developed for named entity disambiguation.
Conditional Random Field had trained using word embedding as feature for Named entity
Recognition. Context based similarity measurement had been implemented for named entity
disambiguation. Cosine, Euclidean distance and Jaccard coefficient similarity had tested for
context similarity measurement between target and candidate entity context.
From the experiments the highest F-score achieved for Named Entity Recognition was 82.3%
using the CRF classifier. The result is similar to state of the art. However, the feature extractor
is unsupervised and don’t depend on other NLP application. The highest accuracy named entity
disambiguation was 62.93% with named entity recognition and 74.21% with data set which had
been annotated by human being.