Abstract:
In current day the availability of digital technology enables world community to communicate and
exchange information easily. As a result of which, we are in the era of information overloading where
various types of information is collected from different sources. As the amount of available digital
information increases it is difficult to access information efficiently from different sources. To address this
problem, machine leaning based NLP has a great contribution. In this work we focused on semantic based
similarity measure for plagiarism detection from Afaan Oromo documents. To use the semantic approach,
we built a sample dictionary for synonym terms representation. The study used LSI approach to decompose
sentences into terms matrix for similarity calculation. We have collected 3 documents with 15 sentences,
14 sentences and 11 sentences. The documents are collected from different sources like two documents
from Afaan Oromo published fiction and one document of personal bibliography from Afaan Oromo FBC.
Preprocessing of text has been applied to the dataset. Java programming has been used to develop a
prototype of the proposed model and SQL has been used to build sample dictionary.
The performance of the study work was tested on 10 sentences of suspicious query and 3 source documents
of 275 key terms. The accuracy achieved in detecting plagiarism from suspicious query was 53.02 %.
The result gained was not high due to less dataset. In addition stemming and POS tagging has not been
applied this work. The accuracy can be improved with big dataset, applying stemming and POS tagging
will the recommendation for this study for future step.