Abstract:
Natural language processing has an important role in our daily life, by enabling computers to
understand human languages. Text chunking is one of an essential task in NLP applications. The
Information generated by this task can be helpful for many purposes, including automatic text
summarizing, question answering, information extraction and so on. Text chunking or shallow
parsing is a kind of NLP task, which is the process of grouping the input text or sentence into
syntactically related non-overlapping part of words or chunks like Noun Phrase, Verb Phrase,
Prepositional Phrase, Adjective Phrase and so on. Generally, the notion of TC is a words or
chunks should be a member of one syntactic structure; chunks can’t be a member of two or more
syntactic structure.
The objective of this research work is to develop Amharic text chunker using machine learning
approach, specifically by adopting conditional random fields and memory-based learning. To
get the optimal feature set of the chunker; the researcher’s conduct different experiment using
different scenarios until a promising result obtained.
In this study different sentences are collected from Amharic grammar books,new
articles,magazines and news of Walta Information Center (WIC) for the training and testing
datasets. Unlike the data collected from WIC, the data collected from Amharic grammar books
,new articles,megazines are not tagged at all. Thus, these datasets were analyzed and tagged
manually and used as a corpus for our model training and testing. But the entire datasets were
chunk tagged manually for the training dataset and approved by linguistic professionals. For the
identification of the boundary of the phrases IOB2 chunk specification is selected and used in
this study.
Experiments have been conducting using the training and testing datasets using different
scenarios to get better accuracy of the chunker. The experiments on Amharic text chunking
scored the highest accuracy of 97.26% and 82.08% using CRFs and MBL respectively.