简体   繁体   中英

Voice Recognition (with ML?), not Speech Recognition

I'm looking for a sample code for a voice recognition (not to be confused with speech recognition), that is - I need to build a model which can detect a certain person's voice.

I will probably end up trying to tweak the Tensor Flow "Simple Audio Recognition" with my own data...is this the best course of action? Any other suggestions?

A lot depends on the specific scenario. How many training samples will you have? How many people do you intend to recognise? What's the signal to noise ratio? How much time would the system have to identify person? How strict should it be?

Still, I can already tell you that starting with neural networks is a poor course of action, as you immediately forsake understanding of the domain. Troubleshooting of misbehaving neural network is far more cumbersome than majority of other learning systems.

I would recommend building your own features rather than relying on ANN from start. I will assume for the moment that you're OK with Python (as majority of TF users) and propose modules like:

As a one way to take, you could compute with any of the three MFCC and build baseline system on these. Typically per window you compute 40 coefficients or more, and these can be visualised as spectrograms. The latter can be interpreted as images and, if you feel like it, you can use deep learning as on them (it's a popular choice).

Mind that "speaker recognition" is a whole field in biometric identification and there are plethora of papers that discuss good approaches.

Speaker recognition has its own specific compared to speech recognition. I would recommend you to start with some dedicated toolkits.

SPEAR is such a project, supplied with ready-to-use examples.

There is also ALIZE , but it is a bit old and more complicated in use, from my point of view.

HTK is a speech recognition software, but can be used for your task as well: htk-speaker-recognition . There is even a master thesis published on this: Speaker Recognition System Using HTK .

I was building a simple speaker recognition system and found indeed that a very simple GMM-UBM model built with HTK was giving the best results.

Update:

I completely forgot about SIDEKIT . It is a cool toolkit, successor of ALIZE. I also have some working example for it: https://www.dropbox.com/sh/iwbog5oiqhi2wo3/AACnj1Uhazqb-LQY_ztX66PDa?dl=0

For modern NN implementation which is relatively easy to use you can try

https://github.com/mravanelli/SincNet

You can train it on a public voxceleb database to get best separation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM