Abstract:
The Evaluation of students’ capacity to construct a sustained argument with subjective questions
allows mentors to assess implicit understanding ability of learners. However, manual evaluation
of subjective question is challenging process and results grading inconsistency. From early 1960
several approaches are proposed to automate subjective question marking by giving due attention
for essays. Recently, with advent of deep learning technique automatic essay assessment shown
improved result that approaches to human raters without need of handcrafted features.
The aims of this study were to model that can able to evaluate both essay and short answer
questions without handcrafted features using deep learning technique. Given essay or short answer
word sequences, our model first embed word level context using FastText word vectors and subword embedding built by character based convolutional neural network. For essay, the model
encodes embedded essay vectors hierarchically by applying two level bidirectional recurrent
neural network. We applied hierarchical word and sentence level attention that extract most salient
words encapsulated in a sentences and sentences encapsulated in an essay respectively. For short
answer, we used the same encoder as essay for both model and student answer vectors. Then, we
applied reference attention on encoded vectors using model answer vector as weight. Finally,
answer-to-answer attention is applied to get the relatedness level of resulting vector and encoded
model answer from model to student and student to model answer.
We evaluated our model on three datasets: Kaggle essay and short answer English dataset and
Amharic short answer dataset prepared for this thesis work. Experimental results on Kaggle dataset
show that our model achieves the state-of-the-art performance for both essay and short answer by
improving weighted Kappa to +2 and +4 respectively. The experiment done on Amharic dataset
shows promising result by achieving 66% and 62% correlation on Pearson and Kappa respectively
on small sized dataset. This shows our model is capable of evaluating both short answer and essay
questions from any domain in very human like way if trained on enough data. Our work not
considered subjective questions with formulas and diagrams and we left open. We also recommend
to include feedback that show how the model scored and rated missed points to student answer.