简体   繁体   English

语音识别(使用ML?),而不是语音识别

[英]Voice Recognition (with ML?), not Speech Recognition

I'm looking for a sample code for a voice recognition (not to be confused with speech recognition), that is - I need to build a model which can detect a certain person's voice. 我正在寻找用于语音识别的示例代码(不要与语音识别相混淆),即-我需要构建一个可以检测特定人的语音的模型。

I will probably end up trying to tweak the Tensor Flow "Simple Audio Recognition" with my own data...is this the best course of action? 我可能最终会尝试用我自己的数据调整Tensor Flow“简单音频识别” ...这是最好的做法吗? Any other suggestions? 还有其他建议吗?

A lot depends on the specific scenario. 在很大程度上取决于具体情况。 How many training samples will you have? 您将拥有多少培训样本? How many people do you intend to recognise? 您打算认识多少人? What's the signal to noise ratio? 信噪比是多少? How much time would the system have to identify person? 系统必须识别多少时间? How strict should it be? 应该有多严格?

Still, I can already tell you that starting with neural networks is a poor course of action, as you immediately forsake understanding of the domain. 不过,我已经告诉您,从神经网络开始是一个糟糕的做法,因为您立即放弃对领域的理解。 Troubleshooting of misbehaving neural network is far more cumbersome than majority of other learning systems. 与大多数其他学习系统相比,对行为异常的神经网络进行故障排除要麻烦得多。

I would recommend building your own features rather than relying on ANN from start. 我建议您构建自己的功能,而不是一开始就依赖ANN。 I will assume for the moment that you're OK with Python (as majority of TF users) and propose modules like: 现在,我假设您对Python(作为大多数TF用户)没问题,并提出以下模块:

As a one way to take, you could compute with any of the three MFCC and build baseline system on these. 作为一种选择,您可以使用这三个MFCC中的任何一个进行计算,并以此为基础构建基准系统。 Typically per window you compute 40 coefficients or more, and these can be visualised as spectrograms. 通常,每个窗口计算40个或更多的系数,这些系数可以可视化为频谱图。 The latter can be interpreted as images and, if you feel like it, you can use deep learning as on them (it's a popular choice). 后者可以解释为图像,并且,如果您愿意的话,可以对它们进行深度学习(这是一种流行的选择)。

Mind that "speaker recognition" is a whole field in biometric identification and there are plethora of papers that discuss good approaches. 请注意,“说话者识别”是生物识别的一个完整领域,并且有大量论文讨论了良好的方法。

Speaker recognition has its own specific compared to speech recognition. 与语音识别相比,说话人识别具有其自身的特定性。 I would recommend you to start with some dedicated toolkits. 我建议您从一些专用工具包开始。

SPEAR is such a project, supplied with ready-to-use examples. SPEAR就是这样一个项目,提供了立即可用的示例。

There is also ALIZE , but it is a bit old and more complicated in use, from my point of view. 还有ALIZE ,但从我的角度来看,它有点旧并且使用起来更复杂。

HTK is a speech recognition software, but can be used for your task as well: htk-speaker-recognition . HTK是语音识别软件,但也可以用于您的任务: htk-speaker-recognition There is even a master thesis published on this: Speaker Recognition System Using HTK . 甚至还发表了一篇硕士论文: 使用HTK的说话人识别系统

I was building a simple speaker recognition system and found indeed that a very simple GMM-UBM model built with HTK was giving the best results. 我当时正在构建一个简单的说话人识别系统,但确实发现,使用HTK构建的非常简单的GMM-UBM模型可以提供最佳效果。

Update: 更新:

I completely forgot about SIDEKIT . 我完全忘记了SIDEKIT It is a cool toolkit, successor of ALIZE. 它是ALIZE的继任者,是一个很棒的工具包。 I also have some working example for it: https://www.dropbox.com/sh/iwbog5oiqhi2wo3/AACnj1Uhazqb-LQY_ztX66PDa?dl=0 我也有一些适用的示例: https : //www.dropbox.com/sh/iwbog5oiqhi2wo3/AACnj1Uhazqb-LQY_ztX66PDa?dl=0

For modern NN implementation which is relatively easy to use you can try 对于相对易于使用的现代NN实现,您可以尝试

https://github.com/mravanelli/SincNet https://github.com/mravanelli/SincNet

You can train it on a public voxceleb database to get best separation. 您可以在公共voxceleb数据库上对其进行训练,以获得最佳分离效果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM