Amharic text chunker using machine leaning approach

Birhan Hailu

Amharic text chunker using machine leaning approach

Birhan Hailu

URI: https://repository.ju.edu.et//handle/123456789/5391

Date: 2019

Abstract:

Natural language processing has an important role in our daily life, by enabling computers to understand human languages. Text chunking is one of an essential task in NLP applications. The Information generated by this task can be helpful for many purposes, including automatic text summarizing, question answering, information extraction and so on. Text chunking or shallow parsing is a kind of NLP task, which is the process of grouping the input text or sentence into syntactically related non-overlapping part of words or chunks like Noun Phrase, Verb Phrase, Prepositional Phrase, Adjective Phrase and so on. Generally, the notion of TC is a words or chunks should be a member of one syntactic structure; chunks can’t be a member of two or more syntactic structure. The objective of this research work is to develop Amharic text chunker using machine learning approach, specifically by adopting conditional random fields and memory-based learning. To get the optimal feature set of the chunker; the researcher’s conduct different experiment using different scenarios until a promising result obtained. In this study different sentences are collected from Amharic grammar books,new articles,magazines and news of Walta Information Center (WIC) for the training and testing datasets. Unlike the data collected from WIC, the data collected from Amharic grammar books ,new articles,megazines are not tagged at all. Thus, these datasets were analyzed and tagged manually and used as a corpus for our model training and testing. But the entire datasets were chunk tagged manually for the training dataset and approved by linguistic professionals. For the identification of the boundary of the phrases IOB2 chunk specification is selected and used in this study. Experiments have been conducting using the training and testing datasets using different scenarios to get better accuracy of the chunker. The experiments on Amharic text chunking scored the highest accuracy of 97.26% and 82.08% using CRFs and MBL respectively.

Show full item record