简体   繁体   English

使用其音频源和开源工具有效地生成预转录语音的时间索引

[英]Efficiently generating time index of pre-transcribed speech using it's audio source and open source tools

On TED.com they have transcriptions and they go to the appropriate section of the video when clicking a part of the transcription. 在TED.com上,他们有转录,单击转录的一部分时,它们会转到视频的相应部分。

I want to do this for 80 hours of audios and transcriptions I have, on Linux with OSS. 我想在具有OSS的Linux上使用80个小时的音频和转录来进行此操作。

This is the approach I'm thinking: 这是我在想的方法:

  1. Start small with a 30 minuite sample 从30分钟的样本开始
  2. Split the audio up into 2 minute WAV file formatted chunks, even if it breaks words up 将音频拆分为2分钟的WAV文件格式的块,即使它会使单词破碎
  3. Run the phrase spotter from CMU Sphinx's long-audio-aligner on each chunk, with the transcript 在每个块上使用CMU Sphinx的long-audio-aligner运行短语Spotter
  4. Take the time index for identified words/phrases found in each bit and calculate the actual estimated time of the ngrams in the original audio file. 取每个位中找到的已识别单词/短语的时间索引,并计算原始音频文件中ngram的实际估计时间。

Does this seem like an efficient approach? 这似乎是一种有效的方法吗? Has anyone actually done this? 有人真的这样做过吗?

Are there alternate approaches that are worth trying like dumb word counting that may be accurate enough? 是否有其他值得尝试的替代方法,例如愚蠢的字数统计可能足够准确?

You can just feed all your audio and text in a long audio aligner and it will give you the timestamps of the words. 您只需将所有音频和文本输入一个较长的音频对齐器中,它就会为您提供单词的时间戳。 Using this timestamps you can jump to the specific word in a file. 使用此时间戳,您可以跳至文件中的特定单词。

I'm not sure why do you want to split your audio or do something else. 我不确定为什么要分割音频或做其他事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM