简体   繁体   English

如何将本地html5录制的音频的float32Array格式转换为Google语音转文本服务的适当字节?

[英]How to convert the float32Array format of native html5 recorded audio to proper bytes for Google Speech-to-Text service?

If you follow this tutorial: https://medium.com/ideas-at-igenius/delivering-a-smooth-cross-browser-speech-to-text-experience-b1e1f1f194a2 you will manage to create a script processor to which you add a listener 如果您遵循此教程: https : //medium.com/ideas-at-igenius/delivering-a-smooth-cross-browser-speech-to-text-experience-b1e1f1f194a2,您将设法创建一个脚本处理器添加一个监听器

scriptProcessor = inputPoint.context.createScriptProcessor(bufferSize, in_channels, out_channels)
//...
scriptProcessor.addEventListener('audioprocess', streamAudioData)

Inside the callback by calling this line: callback_param.inputBuffer.getChannelData(0) one receives a javascript Float32Array which by looking at the data seems to contain float numbers from -1.0 to +1.0 在回调内部,通过调用以下行: callback_param.inputBuffer.getChannelData(0)您将收到一个javascript Float32Array,通过查看数据,它似乎包含从-1.0到+1.0的浮点数

Therefore streaming this to the backend which in turn streams it to Google Speech-To-Text service you are getting nothing (as expected) 因此,将其流式传输到后端,然后将其流式传输到Google Speech-To-Text服务,您一无所获(如预期)

Google Speech-To-Text service, at least in Python, for streaming input expects a byte-string in a wav format which contains the sound in the rate that it was specified (ie 16000Hz). 至少在Python中,用于流式输入的Google Speech-To-Text服务需要wav格式的字节字符串,其中包含指定频率的声音(即16000Hz)。 Note that if in the backend you stream it a file this is working ok. 请注意,如果您在后端流式传输文件,则可以正常工作。

This conversion has failed: Float32Array -> Int16Array -> byte-string 此转换失败:Float32Array-> Int16Array->字节字符串

Has anyone find what are the appropriate conversions for the above to work ? 有谁找到适合上述工作的适当转换方式?

Alternatively are you aware of a simpler more robust path for: Microphone in browser -> stream data via websocket to backend server -> stream data to Google Speech-To-Input service -> get responses as expected ? 另外,您是否知道一种更简单,更健壮的路径:浏览器中的麦克风->通过websocket到后端服务器的数据流->向Google Speech-To-Input服务的数据流->按预期获取响应?


Edit: Adding python code for Recognition Config of Google speech api 编辑:为Google Speech API的识别配置添加python代码

config = types.RecognitionConfig(
        encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=16000,
        language_code=self.language_code)

Ok, did some digging, found the actual documentation which has the proper information. 好的,做了一些挖掘,找到了具有适当信息的实际文档

LINEAR16 - Uncompressed 16-bit signed little-endian samples (Linear PCM). LINEAR16未压缩的16位带符号小尾数采样(Linear PCM)。

The key parts being: 关键部分是:

  • 16-bits per sample 每个样本16位
  • Signed
  • Little-endian 小端

So, what you need to do is scale your floating point values ( -1.0 ... 1.0 ) to integers between -32786 and 32767 . 因此,您需要做的是将浮点值( -1.0 ... 1.0-3278632767之间的整数。

There isn't any built-in JavaScript method to do this for you. 没有任何内置的JavaScript方法可以为您执行此操作。 Your conversions between Float32Array and Int16Array don't work because you'll just end up with values approximating -1 , 0 , and 1 . 你Float32Array和Int16Array之间的转换不工作,因为你刚刚结束了近似值-101 The other reason you can't use Int16Array is because it's endianness is platform dependent ! 您不能使用Int16Array的另一个原因是,它的字节序依赖于平台

What you need to do is get cozy with ArrayBuffers and manipulate them with a DataView . 您需要做的就是熟悉ArrayBuffers并使用DataView对其进行操作。 Take each sample, do some math, write the bytes, move to the next sample. 取每个样本,做一些数学运算,写字节,移到下一个样本。 When you're done, both XHR and the Fetch API support sending an ArrayBuffer as the HTTP request body. 完成后,XHR和Fetch API均支持发送ArrayBuffer作为HTTP请求正文。 Or, you can instantiate a new Blob with that ArrayBuffer and do other things with it. 或者,您可以使用该ArrayBuffer实例化一个新的Blob并执行其他操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM