简体   繁体   中英

define pronunciation starting time for each word in script

I have a text script that is used to create podcasts. So the words in podcast audio are exactly the same as in my text. Now what I want to have is the following:

Word in text | Pronounciation started at
Hello          0:0:0.000
my             0:0:1.125
friends        0:0:2.750

Is that possible to do at all? Thanks in advance!

One of the key words you could start with to approach the complexity of the problem is "forced alignment". This site also covers questions regarding this topic eg here which leads you to questions and answers concerning HTK (the Hidden Markov Model Toolkit) via the releated threads.

You can find a more hands-on style description of how to use forced alignment in automated audio segmentation here .

So the answer is: yes, it is possible, but it is algorithmically very complex and even in its best implementations it is not error-free.

PS.: I found you a really simple tool

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM