Elsevier

Digital Signal Processing

Volume 50, March 2016, Pages 1-11
Digital Signal Processing

Local spectral variability features for speaker verification

https://doi.org/10.1016/j.dsp.2015.10.011Get rights and content

Abstract

Speaker verification techniques neglect the short-time variation in the feature space even though it contains speaker related attributes. We propose a simple method to capture and characterize this spectral variation through the eigenstructure of the sample covariance matrix. This covariance is computed using sliding window over spectral features. The newly formulated feature vectors representing local spectral variations are used with classical and state-of-the-art speaker recognition systems. Results on multiple speaker recognition evaluation corpora reveal that eigenvectors weighted with their normalized singular values are useful in representing local covariance information. We have also shown that local variability features can be extracted using mel frequency cepstral coefficients (MFCCs) as well as using three recently developed features: frequency domain linear prediction (FDLP), mean Hilbert envelope coefficients (MHECs) and power-normalized cepstral coefficients (PNCCs). Since information conveyed in the proposed feature is complementary to the standard short-term features, we apply different fusion techniques. We observe considerable relative improvements in speaker verification accuracy in combined mode on text-independent (NIST SRE) and text-dependent (RSR2015) speech corpora. We have obtained up to 12.28% relative improvement in speaker recognition accuracy on text-independent corpora. Conversely in experiments on text-dependent corpora, we have achieved up to 40% relative reduction in EER. To sum up, combining local covariance information with the traditional cepstral features holds promise as an additional speaker cue in both text-independent and text-dependent recognition.

Introduction

Speaker verification systems use speech features extracted from short-term power spectrum [1]. Commonly used short-term spectral features, such as mel-frequency cepstral coefficients (MFCCs) [2] and perceptual linear prediction (PLP) [3] features, are extracted from speech segments of 20–30 ms duration and they represent spectral characteristics associated with the speech segment [4]. But temporal variation of spectrum also contains useful information about the dynamics of the speech production system. A common way to incorporate this information is to augment delta and double-delta coefficients with the static features computed over a temporal window of 50–100 ms [5], [6]. MFCCs along with deltas and double-deltas remain as the primary features in state-of-the-art speaker verification, due to reasonably high recognition accuracy and straightforward computation. Subsequently, this has sparked great research interest into further ideas such as feature post-processing. For example, cepstral mean and variance normalization (CMVN) [7] and feature warping [8] help to suppress channel and session variations. Different computational blocks of MFCC algorithms have also been explored. For instance, [9] used alternative multiple windowing technique in place of the conventional Hamming window while [10] used regularized linear prediction (LP) analysis for power spectrum estimation. Classical triangular filter bank in MFCC can be replaced with Gaussian-shaped filters [11], gammatone filters [12] and cochlear filters [13]. Root compression technique is prescribed for reducing the dynamic range of mel filter energies as opposed to the logarithmic compression [14]. An improved transformation technique on filter bank log-energies is proposed in [15] which was reported to yield higher recognition accuracy compared to conventional discrete cosine transform (DCT) in clean and noisy conditions.

Recently, further investigations have been carried out for extracting new features for speaker recognition [16], [14] that utilize internally some form of long-term processing before extracting the short-term features. For instance, in frequency domain linear prediction (FDLP) [17], [18], the speech signal is first transformed into frequency domain with DCT operation directly on the speech signal. The subband Hilbert envelopes are computed followed by short-term energy computation from each band. In a more recent work, short-term features called mean Hilbert envelope coefficients (MHECs) are proposed from subband Hilbert envelope of auditory filter output [14]. Here gammatone filter are employed simulating the effect of auditory nerve. Both the FDLP and MHEC features were reported to give high accuracy in both clean and noisy conditions. Another feature set, power-normalized cepstral coefficients (PNCCs), was recently proposed for robust speech recognition [19] and subsequently applied to speaker recognition with success [20]. A common characteristic of these long-term processing ideas, from a practical point of view, is that they have a large number of user-definable parameters that should be carefully chosen, and the settings for different environmental effects and conditions vary widely [16], [21], [22], [14], [19]. This makes the end-users task difficult when finding best feature configuration for a certain environment. In this paper, we introduce a new feature extraction technique which models the local feature-space variability and can be computed from any spectral features, similar to delta features. The variability of features is calculated directly from the covariances of the pre-computed cepstral features.

The use of covariance information has a long history in speech processing and speaker verification is no exception. Since the speech signal varies a lot depending on spoken content, channel, background noise and various other situational parameters, the acoustic features computed from the signal for the same speaker are never exact replicas across training and test utterances. To compensate for such nuisance variations, the speaker and language community has put considerable effort into (co)variance modeling of features and speaker models [23], [24]. In the classic techniques, uncertainty of speaker means is captured by covariance matrices in a Gaussian mixture model (GMM) [25]. In state-of-the-art systems, covariance modeling plays a major role at the later stages of the recognizer pipeline. For instance, nuisance attribute projection (NAP) [26] and within-class covariance normalization (WCCN) [27] utilize, respectively, the estimated channel and within-speaker covariance matrices to suppress the respective effects from GMM supervectors [26] or i-vectors [28]. Similarly, taking into account the uncertainty propagation at the PLDA model [29] helps to improve speaker recognition score with the use of posterior covariance estimation.

In most of the above-cited studies, covariance information has been used as a secondary tool for the purpose of suppressing nuisance variations from the primary acoustic features (such as MFCCs) or higher-level compact representations derived from them (such as i-vectors). In contrast to these prior studies, a new viewpoint of our work is a study of covariance features for speaker characterization. To this end, the proposed features are obtained using a low-cost procedure from time-localized covariance information of arbitrary acoustic features, such as MFCCs. To this end, our input acoustic features include not only standard MFCC features but also the recently studied alternative parameterizations, so-called FDLP, MHEC and PNCC. Our method is inspired by the successful use of covariance-based features in applications outside of speech technology, such as movement detection and image classification [30], [31], [32], blind source separation [33], anomaly detection in a network [34], similarity analysis of multivariate time-series [35] and brain-computer interfacing applications [36]. To this end, the intention of the present study is to provide a feasibility study of such features for speaker characterization. We first motivate and detail our proposed approach in Sections 2 and 3. We describe the experimental set up in Section 4 followed by Section 5 that provides extensive experimentation on three of the standard NIST speaker recognition evaluation (SRE) corpora (2001, 2008 and 2010) and recently released RSR2015. Section 6 provides a summary of our findings. Finally, for reproducibility and to spark further research interest to this direction, we provide an open-source implementation of the proposed method.1

Section snippets

Local variability features: motivation

In speaker recognition, the total variation in feature space is captured by the covariances computed over all the features. But this neglects the variations of the features for a short time duration during the articulation of various speech segments. A previous study has suggested that these variations might be more related to the spoken text [37]. But as each individual has his or her own unique articulatory behavior even for the same spoken content, we argue that measuring that variation

Local spectral variation from short-term covariance

The short-term spectral features of a speech utterance for different frames can be viewed as a multivariate time-series where one spectral frame represents a “snapshot” of the speech production system. The variations in this multivariate data can be measured by computing its covariance matrix [39]. In conventional Gaussian mixture modeling (GMM) that underlies in both classic [25], [40] and modern recognizers [28], covariance of each mixture component represents spectral variability within the

Database description

We evaluate speaker verification accuracy on three NIST corpora. First, we perform extensive experiments on NIST SRE 20013 to find out optimal parameter configurations. Then we apply it on the telephone sub-conditions of NIST SRE 20084 and 2010.5 We have selected C6 sub-condition from NIST SRE 2008 containing all the telephone speech trials. From NIST

Results

We first study the newly proposed eigenstructure features for an arbitrarily chosen temporal window length (here 100 ms). As in our present setup, the speech frame size is 20 ms with 10 ms overlap, we consider nine frames (i.e., context of four frames in each direction) for covariance computation. In order to keep the dimensionality of the proposed feature same to that of the baseline MFCCs (i.e., 57), only the first three eigenvectors corresponding to the highest eigenvalues are considered (

Conclusion

Most speaker verification methods rely on speaker means, for instance, in the form of GMM supervectors or i-vectors, while the use of (co)variance features has been much less explored. To this end, the main intention of this study was to investigate feasibility of local covariance features for speaker characterization. We have proposed a new straightforward speech parameterization from short-term covariance matrix based on eigenstructure analysis. Similar to delta features, the proposed

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions which have greatly helped in improving the content of this paper. This work was funded from Academy of Finland (proj. nos. 253120 and 283256).

Md Sahidullah received the Ph.D. degree in the area of speech processing from the Department of Electronics and Electrical Communication Engineering of Indian Institute Technology Kharagpur in 2015. Prior to that he obtained the Bachelors of Engineering degree in Electronics and Communication Engineering from Vidyasagar University in 2004 and the Masters of Engineering degree in Computer Science and Engineering (with specialization in Embedded System) from West Bengal University of Technology

References (58)

  • F. Soong et al.

    On the use of instantaneous and transitional spectral information in speaker recognition

    IEEE Trans. Acoust. Speech Signal Process.

    (1988)
  • H. Hermansky

    Mel cepstrum, deltas, double-deltas,.. -what else is new?

  • J. Pelecanos et al.

    Feature warping for robust speaker verification

  • T. Kinnunen et al.

    Low-variance multitaper MFCC features: a case study in robust speaker verification

    IEEE Trans. Audio Speech Lang. Process.

    (2012)
  • C. Hanilci et al.

    Regularized all-pole models for speaker verification under noisy environments

    IEEE Signal Process. Lett.

    (2012)
  • S. Chakroborty

    Some studies on acoustic feature extraction, feature selection and multi-level fusion strategies for robust text-independent speaker identification

    (2008)
  • L. Qi et al.

    An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions

    IEEE Trans. Audio Speech Lang. Process.

    (2011)
  • X. Zhao et al.

    CASA-based robust speaker identification

    IEEE Trans. Audio Speech Lang. Process.

    (2012)
  • S. Ganapathy et al.

    Feature normalization for speaker verification in room reverberation

  • M. Athineos et al.

    Autoregressive modeling of temporal envelopes

    IEEE Trans. Signal Process.

    (2007)
  • S. Ganapathy

    Signal analysis using autoregressive models of amplitude modulation

    (January 2012)
  • C. Kim et al.

    Power-normalized cepstral coefficients (PNCC) for robust speech recognition

  • M. McLaren et al.

    Improving speaker identification robustness to highly channel-degraded speech through multiple system fusion

  • S.H.R. Mallidi et al.

    Robust speaker recognition using spectro-temporal autoregressive models

  • S.O. Sadjadi et al.

    Mean Hilbert envelope coefficients (MHEC) for robust speaker recognition

  • J.P. Campbell

    Speaker recognition: a tutorial

    Proc. IEEE

    (1997)
  • R.D. Zilca

    Text-independent speaker verification using covariance modeling

    IEEE Signal Process. Lett.

    (2001)
  • D. Reynolds et al.

    Robust text-independent speaker identification using Gaussian mixture speaker models

    IEEE Trans. Speech Audio Process.

    (1995)
  • W. Campbell et al.

    SVM based speaker verification using a GMM supervector kernel and NAP variability compensation

  • Cited by (36)

    • HiLAM-aligned kernel discriminant analysis for text-dependent speaker verification

      2021, Expert Systems with Applications
      Citation Excerpt :

      Thus, a particular speaker-phrase example gives rise to five distinct speaker-text classes. These classes represent characteristic speaker idiosyncrasies associated with short and specific speech segments which are important from speaker-text discrimination point of view (Sahidullah & Kinnunen, 2016). The text content of each of these classes are, as a result, more specific and constrained in nature.

    • A lightweight encryption scheme using Chebyshev polynomial maps

      2021, Optik
      Citation Excerpt :

      But, the evaluation of those deviations might encourage for SV because all speakers have his/her own articulatory features yet for the analogous vocal data. The local covariance with different recognized communication pieces of different amplifiers was evaluated through the global covariance matrix [5]. Thus, the spectral attributes were mined by using the pre-determined range of the sliding window.

    • Optimization of data-driven filterbank for automatic speaker verification

      2020, Digital Signal Processing: A Review Journal
      Citation Excerpt :

      We extracted 19 coefficients after discarding the first coefficient. Finally, a 57-dimensional feature vector [67] is formulated after appending delta and double-delta coefficients. The MFCCs are filtered with RASTA processing [17] to remove slowly varying channel effect.

    • Modeling ore generation in a magmatic context

      2020, Ore Geology Reviews
      Citation Excerpt :

      The Levenberg-Marquardt scheme replaces the inverse of the eigenvalues by a function with a damping parameter (Lawson and Hanson, 1987) that stabilizes the results (Vigneresse, 1978). The singular value decomposition and pseudoinverse have been successfully applied to signal processing (Sahidullah and Kinnunen, 2016) and image processing (Sadek, 2012), In particular they are widely used for image compression (Andrews and Patterson, 1976). Other applications focus on large data sets, such as genomic signal processing (Alter and Golub, 2004).

    View all citing articles on Scopus

    Md Sahidullah received the Ph.D. degree in the area of speech processing from the Department of Electronics and Electrical Communication Engineering of Indian Institute Technology Kharagpur in 2015. Prior to that he obtained the Bachelors of Engineering degree in Electronics and Communication Engineering from Vidyasagar University in 2004 and the Masters of Engineering degree in Computer Science and Engineering (with specialization in Embedded System) from West Bengal University of Technology in 2006. In 2007–2008, he was with Cognizant Technology Solutions India PVT Limited. Since 2014, he is working as a post-doctoral researcher in the School of Computing, University of Eastern Finland. His research interest includes speaker recognition, voice activity detection.

    Tomi Kinnunen received the Ph.D. degree in computer science from the University of Eastern Finland (UEF, formerly Univ. of Joensuu) in 2005. From 2005 to 2007, he was an associate scientist at the Institute for Infocomm Research (I2R) in Singapore. Since 2007, he has been with UEF. In 2010–2012, his research was funded by the Academy of Finland in a post-doctoral project focusing on speaker recognition. In 2014, he chaired Odyssey 2014: The Speaker and Language Recognition workshop. He also acts as an associate editor in two journals, IEEE/ACM Transactions on Audio, Speech and Language Processing and Digital Signal Processing. His primary research interests are in the broad area of speaker and language recognition where he has authored about 100 peer-reviewed scientific publications.

    View full text