简体   繁体   English

将原始PCM数据转换为RIFF WAV

[英]Converting Raw PCM Data to RIFF WAV

I'm attempting to convert raw audio data from one format to another for the purposes of voice recognition. 我正在尝试将原始音频数据从一种格式转换为另一种格式,以进行语音识别。

  • The audio is received from a Discord server in 20ms chunks in the format: 48Khz, 16-bit stereo signed BigEndian PCM . Discord服务器以20ms块格式接收音频,格式为: 48Khz, 16-bit stereo signed BigEndian PCM
  • I'm using CMU's Sphinx for voice recognition, which takes audio as an InputStream in RIFF (little-endian) WAVE audio, 16-bit, mono 16,000Hz 我正在使用CMU的Sphinx进行语音识别,它将音频作为RIFF (little-endian) WAVE audio, 16-bit, mono 16,000HzInputStream RIFF (little-endian) WAVE audio, 16-bit, mono 16,000Hz

Audio data is received in a byte[] with length 3840 . 在长度为3840byte[]接收音频数据。 This byte[] array contains 20ms of audio in format 1 described above. 这个byte[]数组包含20ms的上述格式1的音频。 That means that 1 second of this audio is 3840 * 50 , which is 192,000 . 这意味着该音频的1秒为3840 * 50 ,即192,000 So that's 192,000 samples per second. 因此,每秒192,000样本。 This makes sense, 48KHz sample rate, times 2 (96K samples) because a byte is 8 bits, and our audio is 16 bit, and times an additional two for stereo. 这是48KHz ,因为48KHz采样率乘以2(96K采样),因为一个字节是8位,而我们的音频是16位,另外是立体声的2倍。 So 48,000 * 2 * 2 = 192,000 . 因此48,000 * 2 * 2 = 192,000

So I first call this method every time an audio packet is received: 因此,每当收到音频数据包时,我首先调用此方法:

private void addToPacket(byte[] toAdd) {
    if(packet.length >= 576000 && !done) {
        System.out.println("Processing needs to occur...");
        getResult(convertAudio());
        packet = null; // reset the packet
        return;
    }

    byte[] newPacket = new byte[packet.length + 3840];
    // copy old packet onto new temp array
    System.arraycopy(packet, 0, newPacket, 0, packet.length);
    // copy toAdd packet onto new temp array
    System.arraycopy(toAdd, 0, newPacket, 3840, toAdd.length);
    // overwrite the old packet with the newly resized packet
    packet = newPacket;
}

This will just add new packets onto one big byte[] until the byte[] contains 3 seconds of audio data (576,000 samples, or 192000 * 3). 这只会将新的数据包添加到一个大byte []上,直到byte []包含3秒的音频数据(576,000个样本或192000 * 3)。 3 seconds of audio data is enough time (just a guess) to detect if the user said the bot's activation hot word like "hey computer.". 3秒钟的音频数据足够时间(只是一个猜测),可以检测用户是否说出了机器人的激活热词,例如“嘿计算机”。 Here's how I convert the sound data: 这是我转换声音数据的方法:

    private byte[] convertAudio() {
        // STEP 1 - DROP EVERY OTHER PACKET TO REMOVE STEREO FROM THE AUDIO
        byte[] mono = new byte[96000];
        for(int i = 0, j = 0; i % 2 == 0 && i < packet.length; i++, j++) {
            mono[j] = packet[i];
        }

        // STEP 2 - DROP EVERY 3RD PACKET TO CONVERT TO 16K HZ Audio
        byte[] resampled = new byte[32000];
        for(int i = 0, j = 0; i % 3 == 0 && i < mono.length; i++, j++) {
            resampled[j] = mono[i];
        }

        // STEP 3 - CONVERT TO LITTLE ENDIAN
        ByteBuffer buffer = ByteBuffer.allocate(resampled.length);
        buffer.order(ByteOrder.BIG_ENDIAN);
        for(byte b : resampled) {
            buffer.put(b);
        }
        buffer.order(ByteOrder.LITTLE_ENDIAN);
        buffer.rewind();
        for(int i = 0; i < resampled.length; i++) {
            resampled[i] = buffer.get(i);
        }

        return resampled;
    }

And finally, attempt to recognize the speech: 最后,尝试识别语音:

private void getResult(byte[] toProcess) {
    InputStream stream = new ByteArrayInputStream(toProcess);
    recognizer.startRecognition(stream);
    SpeechResult result;
    while ((result = recognizer.getResult()) != null) {
        System.out.format("Hypothesis: %s\n", result.getHypothesis());
    }
    recognizer.stopRecognition();
}

The problem I'm having is that CMUSphinx doesn't crash or provide any error messages, it just comes up with an empty hypothesis every 3 seconds. 我遇到的问题是CMUSphinx不会崩溃或不提供任何错误消息,它每3秒就会提出一个空的假设。 I'm not exactly sure why, but my guess is that I didn't convert the sound correctly. 我不确定为什么,但是我的猜测是我没有正确转换声音。 Any ideas? 有任何想法吗? Any help would be greatly appreciated. 任何帮助将不胜感激。

So, there's actual a much better, in-house solution for converting audio from a byte[] . 因此,实际上存在一个更好的内部解决方案,用于从byte[]转换音频。

Here's what I found works pretty well: 这是我发现的效果很好:

        // Specify the output format you want
        AudioFormat target = new AudioFormat(16000f, 16, 1, true, false);
        // Get the audio stream ready, and pass in the raw byte[]
        AudioInputStream is = AudioSystem.getAudioInputStream(target, new AudioInputStream(new ByteArrayInputStream(raw), AudioReceiveHandler.OUTPUT_FORMAT, raw.length));
        // Write a temporary file to the computer somewhere, this method will return a InputStream that can be used for recognition
        try {
            AudioSystem.write(is, AudioFileFormat.Type.WAVE, new File("C:\\filename.wav"));
        } catch(Exception e) {}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM