简体   繁体   English

Kaldi 是否返回任何识别置信度参数,类似于 Google Speech-To-Text API?

[英]Does Kaldi return any recognition confidence parameter, similar to Google Speech-To-Text API?

I am dealing with a speech recognition task.我正在处理语音识别任务。 So far, I have been using the Google Cloud Speech Recognition API (in Python) with good results.到目前为止,我一直在使用Google Cloud Speech Recognition API (在 Python 中),效果很好。 The API returns a confidence value along with every chunk of the transcribed text. API 返回一个置信度值以及转录文本的每个块。 The confidence is a number between 0 and 1 as stated in the docs, but I did not find any deeper explanation of how Google's API derives this number, so I assume it somehow comes from the Neural Network that does the recognition.如文档中所述,置信度是介于 0 和 1 之间的数字,但我没有找到任何更深入的解释来解释 Google 的 API 是如何得出这个数字的,所以我假设它以某种方式来自进行识别的神经网络。

The next step I want to take is to make my own (offline) automatic speech recognition program, and I found that pyKaldi should be fine up to the task.我想采取的下一步是制作我自己的(离线)自动语音识别程序,我发现pyKaldi应该可以胜任这项任务。 I did not start programming it yet, but I want to know beforehand (for research purposes) - can Kaldi return some similar value of confidence, as does the Google Speech-to-Text API?我还没有开始编程,但我想事先知道(出于研究目的)——Kaldi 是否可以返回一些类似的信心值,就像 Google Speech-to-Text API 一样? And what really is this "confidence" , and how is it computed?这种“信心”究竟是什么,它是如何计算的?

Yes, pyKaldi supports confidence values (word confidence score), calculated with minimum bayes risk (MBR).是的,pyKaldi 支持以最小贝叶斯风险 (MBR) 计算的置信度值(单词置信度分数)。 You will find all the necessary information in the documentation.您将在文档中找到所有必要的信息。 Here is the link to the description of the module:这是模块描述的链接:

https://pykaldi.github.io/api/kaldi.lat.html?highlight=mbr#module-kaldi.lat.sausages https://pykaldi.github.io/api/kaldi.lat.html?highlight=mbr#module-kaldi.lat.

As the name says, it is a confidence value, but it is not expressing how "probable" it is that the resulting text output for a word, derived (or given, in a probabilistic setting) from a sequence of audio chunks is correct.顾名思义,它是一个置信度值,但它并不表示从音频块序列派生(或在概率设置中给出)的单词的结果文本 output 的“可能性”是正确的。 In my opinion the expressivity or meaningfulness is a bit fuzzy and depending on the quality of the model and the training data (noise, reverb etc.).在我看来,表现力或意义有点模糊,取决于 model 的质量和训练数据(噪声、混响等)。 It is meaningful in comparing alternatives, telling you the one with the higher value is more likely to be the correct one.比较备选方案很有意义,告诉你价值较高的那个更有可能是正确的。 This in turn poses the problem of which distance to call a significant difference.这反过来又提出了将哪个距离称为显着差异的问题。 A single confidence value does not tell you anything, nor can you compare two different recognizer models only on the basis of their confidence values.单个置信度值不会告诉您任何信息,您也不能仅根据它们的置信度值比较两个不同的识别器模型。 Microsoft terms it "Instead, confidence scores provide a mechanism for comparing the relative accuracy of multiple recognition alternates for a given input. This facilitates returning the most accurate recognition result."微软将其称为“相反,置信度分数提供了一种机制,用于比较给定输入的多个识别替代项的相对准确性。这有助于返回最准确的识别结果。”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM