[英]How to detect filler sound like um, uh, etc using cmusphinx/mozilla deepspeech/google stt etc?
I am working on a project in Speech Recognition and the task is to detect filler sounds like um, uh, eh, etc. on audio clips of children/students speaking in English.我正在做一个语音识别项目,任务是检测说英语的儿童/学生的音频剪辑中的填充声音,如嗯、嗯、嗯等。 Their speaking English is not that great.
他们的英语口语不是很好。
How can this be done using cmuSphinx/Mozilla deep speech/google cloud speech/Kaldi?如何使用 cmuSphinx/Mozilla 深度语音/谷歌云语音/Kaldi 来做到这一点? Or do I need to start from scratch?
还是我需要从头开始?
I also tried to go through other posts and papers on how to build an ASR but since its not a long term project, I do not have the time to spend on building it from scratch and see the results.我还尝试通过其他关于如何构建 ASR 的帖子和论文尝试 go,但由于它不是一个长期项目,我没有时间从头开始构建它并查看结果。 Also, I am okay with less accuracy which I can claim to improve later on.
此外,我可以接受较低的准确性,我可以声称以后会改进。
Have you tried just adding the filler words in your lexicon?您是否尝试过在您的词典中添加填充词? eg the CMU pronunciation dictionary have these words as entries their published lexicon ( LINK TO COMPLETE DICTIONARY )
例如,CMU 发音词典将这些词作为其出版词典的条目( 链接到完整词典)
For example, in the CMU pronunciation dictionary, they have the following entries that correspond to filler sounds例如,在 CMU 发音词典中,它们有以下条目对应于填充音
AH AA1
HM HH AH0 M
HMM HH AH0 M
UH AH1
UHH AH1
UM AH1 M
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.