Abstract:
Character Identi cation is an entity linking task that nds the global entity of each
personal mention on multiparty dialogue. In this work, we combined coreference resolution
and entity linking to accomplish a more complicated task, which is identifying the characters in multiparty dialogue. The personal mentions are detected from nominals referring
to certain characters in a show, and the entities are collected from the list of all characters
in those series of the show. To tackle this task, we introduce a novel coreference resolution
algorithm that selectively create clusters to handle both singular and plural mentions, and
also a convolutional neural network based entity linking model that jointly handles both
types of mentions through multitask learning.
Our approach for tackling this problem has been to model this task as co-reference resolution followed by entity linking for assigning character labels to clusters of named entity
mentions. Using an agglomerative convolutional neural network that takes groups of features and learns mention and mention-pair embeddings vastly improved the cluster purity
scores for coreference resolution. By integrating the two basic tasks deep learning model
was designed to identify the global personal mentions that refers a human characters.
Adjusted evaluation metrics are proposed for these tasks as well to handle the uniqueness
of mentions. Three basic evaluation metrics such as Bcube, BLANC and Ceafe are practiced and each experiment shows that the new coreference resolution and entity linking
models signi cantly outperform on the model developed. To the best of our knowledge,
this is the rst time that dialogue mentions are thoroughly analyzed for resolution tasks.
Transcripts of TV shows are collected as corpus and manually annotated with mentions by
linguistically motivated rules. These mentions are manually linked to their referents. The
dataset used in this work is based on [10] and [15] format, and consists of dialogue from
Two Amharic TV shows: Gemena and Sewlesew in text (transcribed) form. So that, 25
episodes of the shows are annotated, which comprises a total of 164 dialogues, 155 scenes,
1840 mentions, and 146 entities. We use common evaluation metrics to evaluated our
models using those transcribed dataset, and achieve a character identi cation accuracy of
80.65% and an F1-score of 77.2% on the held-out episodes of the annotated test datasets,
and Accuracy of 87.2% and F1-score of 63.2% on the overall dataset used in this research
work.