使用Java SDK將音頻從mic流式傳輸到IBM Watson SpeechToText Web服務

Question

嘗試使用Java SDK將來自麥克風的連續音頻流直接發送到IBM Watson SpeechToText Web服務。 隨分發提供的示例之一（ RecognizeUsingWebSocketsExample ）顯示了如何將.WAV格式的文件流式傳輸到服務。 但是，.WAV文件要求提前指定它們的長度，因此一次只將一個緩沖區附加到文件的簡單方法是不可行的。

似乎SpeechToText.recognizeUsingWebSocket可以獲取流，但是它似乎沒有提供AudioInputStream的實例似乎連接已建立，但即使RecognizeOptions.interimResults(true)也沒有返回成績單。

public class RecognizeUsingWebSocketsExample {
private static CountDownLatch lock = new CountDownLatch(1);

public static void main(String[] args) throws FileNotFoundException, InterruptedException {
SpeechToText service = new SpeechToText();
service.setUsernameAndPassword("<username>", "<password>");

AudioInputStream audio = null;

try {
    final AudioFormat format = new AudioFormat(16000, 16, 1, true, false);
    DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
    TargetDataLine line;
    line = (TargetDataLine)AudioSystem.getLine(info);
    line.open(format);
    line.start();
    audio = new AudioInputStream(line);
    } catch (LineUnavailableException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

RecognizeOptions options = new RecognizeOptions.Builder()
    .continuous(true)
    .interimResults(true)
    .contentType(HttpMediaType.AUDIO_WAV)
    .build();

service.recognizeUsingWebSocket(audio, options, new BaseRecognizeCallback() {
  @Override
  public void onTranscription(SpeechResults speechResults) {
    System.out.println(speechResults);
    if (speechResults.isFinal())
      lock.countDown();
  }
});

lock.await(1, TimeUnit.MINUTES);
}
}

任何幫助將不勝感激。

-rg

以下是基於德語評論的更新（感謝您）。

我能夠使用javaFlacEncode將從麥克風到達的WAV流轉換為FLAC流並將其保存到臨時文件中。 與創建時固定大小的WAV音頻文件不同，可以輕松附加FLAC文件。

    WAV_audioInputStream = new AudioInputStream(line);
    FileInputStream FLAC_audioInputStream = new FileInputStream(tempFile);

    StreamConfiguration streamConfiguration = new StreamConfiguration();
    streamConfiguration.setSampleRate(16000);
    streamConfiguration.setBitsPerSample(8);
    streamConfiguration.setChannelCount(1);

    flacEncoder = new FLACEncoder();
    flacOutputStream = new FLACFileOutputStream(tempFile);  // write to temp disk file

    flacEncoder.setStreamConfiguration(streamConfiguration);
    flacEncoder.setOutputStream(flacOutputStream);

    flacEncoder.openFLACStream();

    ...
    // convert data
    int frameLength = 16000;
    int[] intBuffer = new int[frameLength];
    byte[] byteBuffer = new byte[frameLength];

    while (true) {
        int count = WAV_audioInputStream.read(byteBuffer, 0, frameLength);
        for (int j1=0;j1<count;j1++)
            intBuffer[j1] = byteBuffer[j1];

        flacEncoder.addSamples(intBuffer, count);
        flacEncoder.encodeSamples(count, false);  // 'false' means non-final frame
    }

    flacEncoder.encodeSamples(flacEncoder.samplesAvailableToEncode(), true);  // final frame
    WAV_audioInputStream.close();
    flacOutputStream.close();
    FLAC_audioInputStream.close();

在添加任意數量的幀之后，可以分析生成的文件（使用curl或recognizeUsingWebSocket() ）而沒有任何問題。 但是，只要到達FLAC文件的末尾， encodeSamples(count, false) recognizeUsingWebSocket()將返回最終結果，即使文件的最后一幀可能不是最終的（即，在encodeSamples(count, false) ）。

我希望recognizeUsingWebSocket()可以阻塞，直到最后一幀被寫入文件。 實際上，這意味着分析在第一幀之后停止，因為分析第一幀比收集第二幀花費的時間更少，因此在返回結果時，到達文件的結尾。

這是從Java中用麥克風實現流式音頻的正確方法嗎？ 似乎是一個常見的用例。

這是對RecognizeUsingWebSocketsExample的修改，其中包含了Daniel的一些建議。 它使用PCM內容類型（作為String傳遞，與幀大小一起傳遞），並嘗試發出音頻流的結束信號，盡管不是非常成功的。

和以前一樣，建立連接，但永遠不會調用識別回調。 關閉流似乎也不會被解釋為音頻的結束。 我一定是在誤解這里的東西......

    public static void main(String[] args) throws IOException, LineUnavailableException, InterruptedException {

    final PipedOutputStream output = new PipedOutputStream();
    final PipedInputStream  input  = new PipedInputStream(output);

  final AudioFormat format = new AudioFormat(16000, 8, 1, true, false);
  DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
  final TargetDataLine line = (TargetDataLine)AudioSystem.getLine(info);
  line.open(format);
  line.start();

    Thread thread1 = new Thread(new Runnable() {
        @Override
        public void run() {
            try {
              final int MAX_FRAMES = 2;
              byte buffer[] = new byte[16000];
              for(int j1=0;j1<MAX_FRAMES;j1++) {  // read two frames from microphone
              int count = line.read(buffer, 0, buffer.length);
              System.out.println("Read audio frame from line: " + count);
              output.write(buffer, 0, buffer.length);
              System.out.println("Written audio frame to pipe: " + count);
              }
              /** no need to fake end-of-audio;  StopMessage will be sent 
              * automatically by SDK once the pipe is drained (see WebSocketManager)
              // signal end of audio; based on WebSocketUploader.stop() source
              byte[] stopData = new byte[0];
              output.write(stopData);
              **/
            } catch (IOException e) {
            }
        }
    });
    thread1.start();

  final CountDownLatch lock = new CountDownLatch(1);

  SpeechToText service = new SpeechToText();
  service.setUsernameAndPassword("<username>", "<password>");

  RecognizeOptions options = new RecognizeOptions.Builder()
  .continuous(true)
  .interimResults(false)
  .contentType("audio/pcm; rate=16000")
  .build();

  service.recognizeUsingWebSocket(input, options, new BaseRecognizeCallback() {
    @Override
    public void onConnected() {
      System.out.println("Connected.");
    }
    @Override
    public void onTranscription(SpeechResults speechResults) {
    System.out.println("Received results.");
      System.out.println(speechResults);
      if (speechResults.isFinal())
        lock.countDown();
    }
  });

  System.out.println("Waiting for STT callback ... ");

  lock.await(5, TimeUnit.SECONDS);

  line.stop();

  System.out.println("Done waiting for STT callback.");

}

Dani，我檢測了WebSocketManager的源代碼（附帶SDK），並用一個顯式的StopMessage有效負載替換了對sendMessage()的調用，如下所示：

        /**
     * Send input steam.
     *
     * @param inputStream the input stream
     * @throws IOException Signals that an I/O exception has occurred.
     */
    private void sendInputSteam(InputStream inputStream) throws IOException {
      int cumulative = 0;
      byte[] buffer = new byte[FOUR_KB];
      int read;
      while ((read = inputStream.read(buffer)) > 0) {
        cumulative += read;
        if (read == FOUR_KB) {
          socket.sendMessage(RequestBody.create(WebSocket.BINARY, buffer));
        } else {
          System.out.println("completed sending " + cumulative/16000 + " frames over socket");
          socket.sendMessage(RequestBody.create(WebSocket.BINARY, Arrays.copyOfRange(buffer, 0, read)));  // partial buffer write
          System.out.println("signaling end of audio");
          socket.sendMessage(RequestBody.create(WebSocket.TEXT, buildStopMessage().toString()));  // end of audio signal

        }

      }
      inputStream.close();
    }

sendMessage（）選項（發送0長度二進制內容或發送停止文本消息）似乎都不起作用。 來電代碼與上述相同。 結果輸出是：

Waiting for STT callback ... 
Connected.
Read audio frame from line: 16000
Written audio frame to pipe: 16000
Read audio frame from line: 16000
Written audio frame to pipe: 16000
completed sending 2 frames over socket
onFailure: java.net.SocketException: Software caused connection abort: socket write error

修訂：實際上，從未達到音頻結束通話。 將最后（部分）緩沖區寫入套接字時拋出異常。

為什么連接中止？ 這通常發生在對等方關閉連接時。

至於第2點）：在這個階段，這些問題中的任何一個都是重要的嗎？ 似乎根本沒有啟動識別過程......音頻是有效的（我將流寫入磁盤，並且能夠通過從文件中流式傳輸來識別它，正如我在上面指出的那樣）。

此外，在進一步檢查WebSocketManager源代碼時， onMessage()在從sendInputSteam() return立即發送StopMessage （即，當音頻流或上面示例中的管道消失時），因此無需調用它明確。 問題肯定發生在音頻數據傳輸完成之前。 無論是否將PipedInputStream或AudioInputStream作為輸入傳遞，行為都是相同的。 在兩種情況下發送二進制數據時都會拋出異常。

Answer 1

Java SDK有一個示例並支持此功能。

使用以下命令更新pom.xml ：

 <dependency>
   <groupId>com.ibm.watson.developer_cloud</groupId>
   <artifactId>java-sdk</artifactId>
   <version>3.3.1</version>
 </dependency>

以下是如何收聽麥克風的示例。

SpeechToText service = new SpeechToText();
service.setUsernameAndPassword("<username>", "<password>");

// Signed PCM AudioFormat with 16kHz, 16 bit sample size, mono
int sampleRate = 16000;
AudioFormat format = new AudioFormat(sampleRate, 16, 1, true, false);
DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);

if (!AudioSystem.isLineSupported(info)) {
  System.out.println("Line not supported");
  System.exit(0);
}

TargetDataLine line = (TargetDataLine) AudioSystem.getLine(info);
line.open(format);
line.start();

AudioInputStream audio = new AudioInputStream(line);

RecognizeOptions options = new RecognizeOptions.Builder()
  .continuous(true)
  .interimResults(true)
  .timestamps(true)
  .wordConfidence(true)
  //.inactivityTimeout(5) // use this to stop listening when the speaker pauses, i.e. for 5s
  .contentType(HttpMediaType.AUDIO_RAW + "; rate=" + sampleRate)
  .build();

service.recognizeUsingWebSocket(audio, options, new BaseRecognizeCallback() {
  @Override
  public void onTranscription(SpeechResults speechResults) {
    System.out.println(speechResults);
  }
});

System.out.println("Listening to your voice for the next 30s...");
Thread.sleep(30 * 1000);

// closing the WebSockets underlying InputStream will close the WebSocket itself.
line.stop();
line.close();

System.out.println("Fin.");

Answer 2

您需要做的是將音頻作為文件提供給STT服務，而不是作為無頭音頻樣本流。 您只需通過WebSocket提供從麥克風捕獲的樣本。 您需要將內容類型設置為“audio / pcm; rate = 16000”，其中16000是以Hz為單位的采樣率。 如果您的采樣率不同，這取決於麥克風編碼音頻的方式，您將用您的值替換16000，例如：44100,48000等。

當饋送pcm音頻時，STT服務不會停止識別，直到您通過websocket發送空的二進制消息來發出音頻結束信號。

達尼

查看代碼的新版本，我發現了一些問題：

1）通過websocket發送空的二進制消息可以完成信號的音頻結束，這不是你正在做的事情。 線條

 // signal end of audio; based on WebSocketUploader.stop() source
 byte[] stopData = new byte[0];
 output.write(stopData);

因為它們不會導致發送空的websocket消息，所以沒有做任何事情。 你可以調用方法“WebSocketUploader.stop（）”嗎？

您正在以每個樣本8位捕獲音頻，您應該執行16位以獲得足夠的排隊。 此外，您只需要幾秒鍾的音頻，不適合測試。 你能把你推送到STT的音頻寫到一個文件然后用Audacity打開它（使用導入功能）嗎？ 通過這種方式，您可以確保為STT提供的是良好的音頻。

使用Java SDK將音頻從mic流式傳輸到IBM Watson SpeechToText Web服務

問題描述

2 個解決方案

解決方案1
6 2016-07-14 22:02:20

解決方案2
0 2016-07-06 18:58:03

使用Java SDK將音頻從mic流式傳輸到IBM Watson SpeechToText Web服務

問題描述

2 個解決方案

解決方案1 6 2016-07-14 22:02:20

解決方案2 0 2016-07-06 18:58:03

解決方案1
6 2016-07-14 22:02:20

解決方案2
0 2016-07-06 18:58:03