Learning Speaker Representations with Mutual Information


  • Mirco Ravanelli
  • Yoshua Bengio


Deep Learning, Speaker Recognition, Mutual Information, Unsupervised Learning, SincNet


Learning good representations is of crucial importance in deep learning. Mutual Information (MI) or similar measures of statistical dependence are promising tools for learning these representations in an unsupervised way. Even though the mutual information between two random variables is hard to measure directly in high dimensional spaces, some recent studies have shown that an implicit optimization of MI can be achieved with an encoder-discriminator architecture similar to that of Generative Adversarial Networks (GANs). In this work, we learn representations that capture speaker identities by maximizing the mutual information between the encoded representations of chunks of speech randomly sampled from the same sentence. The proposed encoder relies on the SincNet architecture and transforms raw speech waveform into a compact feature vector. The discriminator is fed by either positive samples (of the joint distribution of encoded chunks) or negative samples (from the product of the marginals) and is trained to separate them. We report experiments showing that this approach effectively learns useful speaker representations, leading to promising results on speaker identification and verification tasks. Our experiments consider both unsupervised and semi-supervised settings and compare the performance achieved with different objective functions.


D. Yu and L. Deng, Automatic Speech Recognition - A Deep Learning Approach. Springer, 2015.

G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep neural networks for large vocabulary speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30–42, 2012.

M. Ravanelli, Deep learning for Distant Speech Recognition. PhD Thesis, Unitn, 2017.

M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “A network of deep neural networks for distant speech recognition,” in Proc. of ICASSP, 2017, pp. 4880–4884.

M. McLaren, Y. Lei, and L. Ferrer, “Advances in deep neural network approaches to speaker recognition,” in Proc. of ICASSP, 2015, pp. 4814–4818.

F. Richardson, D. Reynolds, and N. Dehak, “Deep neural network approaches to speaker and language recognition,” IEEE Signal Processing Letters, vol. 22, no. 10, pp. 1671–1675, 2015.

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.

Y. Bengio, P. L., D. Popovici, and H. Larochelle, “Greedy layerwise training of deep networks,” in Proc. of NIPS, 2007, pp. 153–160.

G. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,” vol. 18, 2006, pp. 1527–1554.

D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” CoRR, vol. abs/1312.6114, 2013.

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. of NIPS, 2014, pp. 2672–2680.

Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domainadversarial training of neural networks,” J. Mach. Learn. Res., vol. 17, no. 1, pp. 2096–2030, Jan. 2016.

D. Serdyuk, P. Brakel, B. Ramabhadran, S. Thomas, Y. Bengio, and K. Audhkhasi, “Invariant representations for noisy speech recognition,” arXiv e-prints, vol. abs/1612.01928, 2016.

P. Brakel and Y. Bengio, “Learning independent features with adversarial nets for non-linear ica,” arXiv e-prints, vol. 1710.05050, 2017.

M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and R. D. Hjelm, “Mutual information neural estimation,” in Proc. of ICML, 2018, pp. 531–540.

A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” CoRR, vol. abs/1807.03748, 2018.

R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, A. Trischler, and Y. Bengio, “Learning deep representations by mutual information estimation and maximization,” arXiv e-prints, vol. 1808.06670, 2018.

D. Applebaum, Probability and Information: An Integrated Approach, 2nd ed. Cambridge University Press, 2008.

J. B. Kinney and G. S. Atwal, “Equitability, mutual information,

and the maximal information coefficient,” Proceedings of the National Academy of Sciences, vol. 111, no. 9, pp. 3354–3359, 2014.

L. Paninski, “Estimation of entropy and mutual information,” Neural Comput., vol. 15, no. 6, pp. 1191–1253, Jun. 2003.

M. Ravanelli and Y. Bengio, “Speaker Recognition from raw waveform with SincNet,” in Proc. of SLT, 2018.

M. Ravanelli and Y.Bengio, “Interpretable Convolutional Filters with SincNet,” in Proc. of NIPS@IRASL, 2018.

M. Ravanelli, T. Parcollet, and Y. Bengio, “The PyTorch-Kaldi Speech Recognition Toolkit,” in Submitted to ICASSP, 2019.

P. Velickovic, W. Fedus, W. L. Hamilton, P. Li`o, Y. Bengio, andR. D. Hjelm, “Deep graph infomax,” CoRR, vol. abs/1809.10341, 2018.

D. Palaz, M. Magimai-Doss, and R. Collobert, “Analysis of CNN-based speech recognition system using raw speech as input,” in Proc. of Interspeech, 2015.

T. N. Sainath, R. J. Weiss, A. W. Senior, K. W. Wilson, and O. Vinyals, “Learning the speech front-end with raw waveform CLDNNs,” in Proc. of Interspeech, 2015.

Z. T¨uske, P. Golik, R. Schl¨uter, and H. Ney, “Acoustic modeling with deep neural networks using raw time signal for LVCSR,” in Proc. of Interspeech, 2014.

G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,” in Proc. of ICASSP, 2016, pp. 5200–5204.

H. Muckenhirn, M. Magimai-Doss, and S. Marcel, “Towards directly modeling raw speech signal for speaker verification using CNNs,” in Proc. of ICASSP, 2018.

F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A uni-fied embedding for face recognition and clustering.” CoRR, vol.abs/1503.03832, 2015.

C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu, “Deep speaker: an end-to-end neural speaker embedding system,” CoRR, vol. bs/1705.02304, 2017.

A. Jati and P. G. Georgiou, “Neural predictive coding using convolutional neural networks towards unsupervised learning of speaker characteristics,” CoRR, vol. abs/1802.07860, 2018.

L. Bahl, P. Brown, P. de Souza, and R. Mercer, “Maximum mutual information estimation of hidden markov model parameters for speech recognition,” in Proc. of ICASSP, vol. 11, 1986, pp. 49–52.

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, “DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus CDROM,” 1993.

A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a largescale speaker identification dataset,” in Proc. of Interspech, 2017.

M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi, and M. Omologo, “The DIRHA-ENGLISH corpus and related tasks for distant-speech recognition in domestic environments,” in Proc. of ASRU 2015, pp. 275–282.

M. Ravanelli, A. Sosi, P. Svaizer, and M. Omologo, “Impulse response estimation for robust speech recognition in a reverberant environment,” in Proc. of EUSIPCO, 2012, pp. 1668–1672.

J. Ba, R. Kiros, and G. E. Hinton, “Layer normalization,” CoRR, vol. abs/1607.06450, 2016.

A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. of ICML, 2013.

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. of ICML, 2015, pp. 448–456.

M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Batchnormalized joint training for DNN-based distant speech recognition,” in Proc. of SLT, 2016.

E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. GonzalezDominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in Proc. of ICASSP, 2014, pp. 4052–4056.

N. Le and J. Odobez, “Robust and discriminative speaker embedding via intra-class distance variance regularization,” in Proc. of Interspeech, 2018, pp. 2257–2261.

A. K. Sarkar, D. Matrouf, P. Bousquet, and J. Bonastre, “Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification,” in Proc. of Interspeech, 2012, pp. 2662–2665.