將原始PCM數據轉換為RIFF WAV

Question

我正在嘗試將原始音頻數據從一種格式轉換為另一種格式，以進行語音識別。

從Discord服務器以20ms塊格式接收音頻，格式為： 48Khz, 16-bit stereo signed BigEndian PCM 。
我正在使用CMU的Sphinx進行語音識別，它將音頻作為RIFF (little-endian) WAVE audio, 16-bit, mono 16,000Hz的InputStream RIFF (little-endian) WAVE audio, 16-bit, mono 16,000Hz

在長度為3840的byte[]接收音頻數據。 這個byte[]數組包含20ms的上述格式1的音頻。 這意味着該音頻的1秒為3840 * 50 ，即192,000 。 因此，每秒192,000樣本。 這是48KHz ，因為48KHz采樣率乘以2（96K采樣），因為一個字節是8位，而我們的音頻是16位，另外是立體聲的2倍。 因此48,000 * 2 * 2 = 192,000 。

因此，每當收到音頻數據包時，我首先調用此方法：

private void addToPacket(byte[] toAdd) {
    if(packet.length >= 576000 && !done) {
        System.out.println("Processing needs to occur...");
        getResult(convertAudio());
        packet = null; // reset the packet
        return;
    }

    byte[] newPacket = new byte[packet.length + 3840];
    // copy old packet onto new temp array
    System.arraycopy(packet, 0, newPacket, 0, packet.length);
    // copy toAdd packet onto new temp array
    System.arraycopy(toAdd, 0, newPacket, 3840, toAdd.length);
    // overwrite the old packet with the newly resized packet
    packet = newPacket;
}

這只會將新的數據包添加到一個大byte []上，直到byte []包含3秒的音頻數據（576,000個樣本或192000 * 3）。 3秒鍾的音頻數據足夠時間（只是一個猜測），可以檢測用戶是否說出了機器人的激活熱詞，例如“嘿計算機”。 這是我轉換聲音數據的方法：

    private byte[] convertAudio() {
        // STEP 1 - DROP EVERY OTHER PACKET TO REMOVE STEREO FROM THE AUDIO
        byte[] mono = new byte[96000];
        for(int i = 0, j = 0; i % 2 == 0 && i < packet.length; i++, j++) {
            mono[j] = packet[i];
        }

        // STEP 2 - DROP EVERY 3RD PACKET TO CONVERT TO 16K HZ Audio
        byte[] resampled = new byte[32000];
        for(int i = 0, j = 0; i % 3 == 0 && i < mono.length; i++, j++) {
            resampled[j] = mono[i];
        }

        // STEP 3 - CONVERT TO LITTLE ENDIAN
        ByteBuffer buffer = ByteBuffer.allocate(resampled.length);
        buffer.order(ByteOrder.BIG_ENDIAN);
        for(byte b : resampled) {
            buffer.put(b);
        }
        buffer.order(ByteOrder.LITTLE_ENDIAN);
        buffer.rewind();
        for(int i = 0; i < resampled.length; i++) {
            resampled[i] = buffer.get(i);
        }

        return resampled;
    }

最后，嘗試識別語音：

private void getResult(byte[] toProcess) {
    InputStream stream = new ByteArrayInputStream(toProcess);
    recognizer.startRecognition(stream);
    SpeechResult result;
    while ((result = recognizer.getResult()) != null) {
        System.out.format("Hypothesis: %s\n", result.getHypothesis());
    }
    recognizer.stopRecognition();
}

我遇到的問題是CMUSphinx不會崩潰或不提供任何錯誤消息，它每3秒就會提出一個空的假設。 我不確定為什么，但是我的猜測是我沒有正確轉換聲音。 有任何想法嗎？ 任何幫助將不勝感激。

Answer 1

因此，實際上存在一個更好的內部解決方案，用於從byte[]轉換音頻。

這是我發現的效果很好：

        // Specify the output format you want
        AudioFormat target = new AudioFormat(16000f, 16, 1, true, false);
        // Get the audio stream ready, and pass in the raw byte[]
        AudioInputStream is = AudioSystem.getAudioInputStream(target, new AudioInputStream(new ByteArrayInputStream(raw), AudioReceiveHandler.OUTPUT_FORMAT, raw.length));
        // Write a temporary file to the computer somewhere, this method will return a InputStream that can be used for recognition
        try {
            AudioSystem.write(is, AudioFileFormat.Type.WAVE, new File("C:\\filename.wav"));
        } catch(Exception e) {}

將原始PCM數據轉換為RIFF WAV

問題描述

1 個解決方案

解決方案1
0 已采納 2017-12-24 05:34:08

將原始PCM數據轉換為RIFF WAV

問題描述

1 個解決方案

解決方案1 0 已采納 2017-12-24 05:34:08

解決方案1
0 已采納 2017-12-24 05:34:08