简体   繁体   English

如何确定语音识别中HMM的观察序列长度

[英]How to determine length of observation sequence for HMM in speech recognition

I'm re-learning how to use Hidden Markov Models for speech recognition and I have a question. 我正在重新学习如何使用隐马尔可夫模型进行语音识别,我有一个问题。 It seems that most/all discussions of using HMM's consider the case of a known sequence of observation: [O1, O2, O3,...,OT] where T is a known number. 似乎大多数/所有使用HMM的讨论都考虑了已知观测序列的情况:[O1,O2,O3,...,OT]其中T是一个已知数。 However, if we were to try to use a trained HMM on speech in real time, or in a WAV file where someone was speaking one sentence after another, how exactly does one select the value of T? 但是,如果我们尝试在语音上实时使用受过训练的HMM,或者在有人逐个说出一句话的WAV文件中使用经过训练的HMM,那么究竟如何选择T的值呢? In other words, how does one know when the speaker has ended one sentence and started another? 换句话说,如何知道说话者何时结束一个句子并开始另一个句子? Does a practical HMM for speech recognition just use a fixed value for T and periodically recomputes the optimal state sequence up to the current observation using a fixed size window of length T into the past? 实际的用于语音识别的HMM是否仅使用固定值T并使用过去固定长度的长度为T的窗口周期性地重新计算最佳状态序列,直到当前观察? Or is there some better way for dynamically selecting T at any instance of time? 还是有一些更好的方法可以随时随地动态选择T?

Does a practical HMM for speech recognition just use a fixed value for T and periodically recomputes the optimal state sequence up to the current observation using a fixed size window of length T into the past? 实际的用于语音识别的HMM是否仅使用固定值T并使用过去固定长度的长度为T的窗口周期性地重新计算最佳状态序列,直到当前观察?

Viterbi decoding algorithm works frame by frame, so you just iterate over frames, you can iterate indefinitely until backtracking matrix fills all the memory. Viterbi解码算法逐帧工作,因此您只需遍历帧,就可以无限期地进行迭代,直到回溯矩阵填满所有内存为止。

Training algorithm considers audios that are prepared before training, usually 1-30 seconds. 训练算法会考虑训练前准备的音频,通常为1-30秒。 For training audio length is already known. 用于训练的音频长度是已知的。

how does one know when the speaker has ended one sentence and started another? 如何知道说话者何时结束一个句子并开始另一个句子?

There are different strategies here. 这里有不同的策略。 Decoders search for the silence to wrap around decoding. 解码器搜索静音以环绕解码。 Silence doesn't necessary mean the break between sentences, there could be no break between sentences at all. 沉默不一定意味着句子之间的中断,句子之间可能根本没有中断。 There could be break in the middle of a sentence too. 句子的中间也可能会中断。

So to find silence decoder can use standalone voice activity detection algorithm and break when VAD detects silence or decoder can analyze backtrack information to decide if silence appeared. 因此,要找到静音,解码器可以使用独立的语音活动检测算法,并在VAD检测到静音时中断,或者解码器可以分析回溯信息来确定是否出现静音。 The second method is a bit more reliable. 第二种方法更可靠。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM