Abstract:
Following the progress of general-purpose speaker recognition technology, specific appli cation oriented systems are emerging based on voice bio-metric. Forensic speaker recogni tion is one core application area of speaker recognition. The chief application of forensic
speaker recognition is identifying the actual criminal among handed suspects relying upon
traced voice evidence. This thesis work aims to adopt a text-independent speaker identi fication for forensic speaker recognition, and examine the impact of training and testing
speech corpora levels of utterance on proofing of identity of the actual criminal among
handed suspects.
The proposed system is designed relying upon two indispensable and consecutive ap proaches, so-called front-end feature extraction and back-end feature classification. The
front-end approach was employed for speaker-specific feature extraction purposes, and
it had been done using a digital signal processing background, specifically using a Mel frequency Cepstral Coefficient (MFCC). The back-end approach is employed for feature
classification (suspected criminal speaker modeling and actual criminal identification)
tasks. A Machine Learning (ML) based Gaussian Mixture Model (GMM) state- of-the art with Expectation-Maximization (EM) algorithm used to build a reference model for
each suspected criminal speaker and the Maximum Log-Likelihood (MLL) score tech nique was employed for actual criminal identification. Also, to enhance the quality of
the speech corpora, and minimize computational complexity from the feature extraction
and feature classification stages, a preprocessing techniques (spectral noise gate based
background noise removal and short-time energy based VAD silence truncation) has been
used before the feature extraction stage.
To evaluate the performance of the proposed system, we have carried out a simulation based implementation using Python programming on the PyCharm environment. A self collected and prepared Amharic language speech corpora used for implementation. The
experimental evaluation of the proposed system is conducted on 20 speakers (who per formed on the behave of suspects) recorded from ongoing mobile phone conversation at
the callee side using a smartphone, and an interview room using a recorder microphone
in the form of a rehearsal reading speech. The system trained and tested using the speech
corpora at three levels of utterance (word, sentence and paragraph). The system achieved
84.29%, 95.00% and 97.50% respective IDRs for WLU, SLU, and PLU of mobile phone
recorded speech corpora and also 85.00%, 96.25%, and 97.50% for microphone recorded
speech corpora. From this study observation, apart from selecting fitting feature extrac tion and modeling approaches, the level of utterance of a corpora also a significant role in
determining the recognition performance, and a corpora with longer level of utterances
is more convenient to attain a better performance. However,the proposed system poorly
performed for crossed levels of utterances and multi-modal recording training-testing sce narios yet. Hence, improving these poor performances can be the next research direction
of this study