I have a text script that is used to create podcasts. So the words in podcast audio are exactly the same as in my text. Now what I want to have is the following:
Word in text | Pronounciation started at
Hello 0:0:0.000
my 0:0:1.125
friends 0:0:2.750
Is that possible to do at all? Thanks in advance!
One of the key words you could start with to approach the complexity of the problem is "forced alignment". This site also covers questions regarding this topic eg here which leads you to questions and answers concerning HTK (the Hidden Markov Model Toolkit) via the releated threads.
You can find a more hands-on style description of how to use forced alignment in automated audio segmentation here .
So the answer is: yes, it is possible, but it is algorithmically very complex and even in its best implementations it is not error-free.
PS.: I found you a really simple tool
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.