Abstract:
This thesis focuses on the problem of interpretable semantic textual similarity in English language.
The system takes pair of sentence then it identifies the chunks in each sentence according to
standard gold chunks, align corresponding chunk, assign degree of similarity score as well as
predict reason of similarity/dissimilarity for each aligned chunks. To do this computation
distributional hypothesis approach blend with knowledge based was selected. Latent semantic
analysis (LSA) is a purely statistical technique, which leverages word co-occurrence information
from a large unlabeled large corpus of text relies on the distributional hypothesis that the words
occurring in similar contexts tend to have similar meanings. To do so LSA word similarity
computed from a statistical analysis of preprocessed Wikipedia corpus as well as it boosted by
WordNet and string similarity.
Furthermore semantic similarity measures between corresponding chunks are introduced in the
theoretical part. We selected and implemented 10 similarity measures. In the experimentation part
we proposes five chunk similarity measures inspired by state-of-the-art measures described in the
chapter three. The evaluation is conducted two results (Run1 and Run2) on two data sets (Images
and Headlines).
We can be concluded that the performance of the system obtained was promising and gives a best
result on Run1 which depends on 𝑃𝑂𝑆𝑖𝑚.