Abstract:
In NLP, sentence identification and simplification are necessary for machine translation, parsing,
question generation, information extraction, summarization, semantic role labeling, opinion
mining, etc. The majority of these applications use simple sentences as preprocessing to improve
their functionality, and the high coverage of sentence simplification is used for various social
classes that have language difficulties, such as aphasics, children, and adults learning the
language (non-native speakers).
The study provided a new automatic syntactic Afan Oromo sentence identification and
simplification using a rule-based method that operates on POS tags. In this study, the main
performed task can be separated into two tasks. The first task is the identification and separation
of Afan Oromo declarative sentences into simple, compound, complex, and compound-complex
sentences. The second task is the simplification of compound sentences into simple and self contained sentences by preserving the meaning of the original meaning as much as possible.
Sentence identification and separation were performed to improve the performance of sentence
simplification.
The resursive type algorithm is developed both for sentence identification and simplification
based on the syntactic structure of the sentences. To determine the syntactic structure of the
sentence, the POS Tag is used as a preprocssor and then the sentence type indicators and sentence
simplification features are managed. To evaluate the algorithms, a dataset containing 480
sentences was collected from the Afan Oromo textbook and annotated with the help of an expert.
The performance of the sentence identification and compound sentence simplification
algorithms is separately evaluated in terms of precision and recall using the result gained by the
expet judgments. The expert classifies the identified and simplified sentences as correct or
incorrect by comparing the system's output with the golden standard produced by the language
expert. The sentence simplification evaluation criteria includes grammar and fluency of the
simplified sentence and also the retainment of the original meaning. The overall performance of
both sentence identification and compound sentence simplification is 90% and 84.4% F score respectively. The evaluation result reveals that the proposed algorithm is a promising one,
as it is the beginning of a less resource-intensive study