简体   繁体   中英

How to delay synthesise in Mespeak.js display words while wav plays

I'm working on editing Mespeak.js to help out a friend with visual tracking problems.

I've been looking through Mespeak.js ( http://www.masswerk.at/mespeak/ ) and trying to figure out how to grab each word as it is spoken, and then display it on the screen while the wav file is playing.

I'm thinking this has to do with returning the data as an array, and then displaying the array as the wav plays. I'm not even sure this is possible (or what the raw data looks like).

Here's what I have

div id="display">
    <span>Here.</span>
</div>

<script type="text/javascript">
var timeoutID
var texttosend = prompt('Text to Split');
var res = texttosend.split(" ")
var arrayLength = res.length;
function refresh(word) {
    meSpeak.speak(res[i], {speed: 100});
    console.log(res[i]);
    $( "#display span" ).text(word);
    };

console.log('here');
for (var i = 0; i <= arrayLength; i++) {
        timoutID = window.setTimeout(refresh(res[i]), 50000+(i*50000));
};

There are two problems here - I think they are both related to the delay. No matter what I set the timeoutID to the text is snythesized all at once and the only word displayed is the last one. I've tried using variations of setTimeout and I've tried jQuery's delay. Any ideas on how to help? The console.log outputs each word separately, so I know the separation of text into an array works and the loop works - I think it's now just timing.

Sorry if this doesn't make a ton of sense - I guess some clarity would help me start to dismantle this problem.

Background: meSpeak.js sends the input text to the embedded eSpeak with options for a rendering a wav-file. This wav-file is then played back using either the WebAudio API or an Audio element. Therefor there's no way to tell, which part of a continous utterance is currently played (since we do not know, when a single word would start or respectively end at which point of the audio stream). But, on the other hand, there's something that we may know, namely, when the playback of the audio streamed has finished. Maybe, we could use this one?

To provide a solution for this problem, meSpeak.speak() takes a callback function as an optional 3rd argument, which will be called after the playback of the utterance has finished. (See the JS-rap demo, http://www.masswerk.at/mespeak/rap/ , for a complex example.) Please mind that you will lose any context of the word in a scentence, if you would be doing this with single words, therefor you will lose any melodic modulation of the utterance/sentence. Also, there will be a noteable delay between words.

Example:

function speakWords(txt) {
  var words = txt.split(/\s+/);

  function speakNext() {
    if (words.length) {
      var word = words.shift();
      console.log('speaking: ' + word);
      meSpeak.speak(word, {}, speakNext);
    }
    else {
      console.log('done.');
    }
  }

  speakNext();
}

Here, the inner function "speakNext()" shifts the next word from the queue, logs it and calls meSpeak.speak() with itself as the callback (3rd argument). So, if the audio has finished, "speakNext()" will be called to process the next word. If the queue would eventually be empty, we'll finally hit the else-clause. (You would probably want to replace the simple loggings by a more sophisticated display.)

In a further step of optimization you could first render the partial streams (using the option "rawdata") and then play them back (using meSpeak.play()), like:

function speakWords2(txt) {
  var i, words, streams = [];

  function playNext() {
    if (i < streams.length) {
      console.log('speaking: ' + words[i]);
      meSpeak.play(streams[i], 1, playNext);
      i++;
    }
    else {
      console.log('done.');
    }
  }

  // split utterance and pre-render single words to stream-data
  words = txt.split(/\s+/);
  for (i=0; i < words.length; i++)
      streams.push( meSpeak.speak(words[i], {rawdata: true}) );
  // now play the partial streams (words) in a callback-loop
  i=0;
  playNext();
}

This way, the delay caused by rendering the audio streams will occur in a single block when the function is called and the pre-rendered audio-streams (for each individual word) will be played back without any further load (in the background). On the down side this will increase the memory footprint of your application, since all the high-res audio streams for each of the words are held in the array "streams" at once.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM