如何在python中录制时同时读取音频样本以进行实时语音到文本转换？

Question

Basically I have trained a few models using keras to do isolated word recognition.基本上我已经使用 keras 训练了一些模型来进行孤立的单词识别。 Currently i can record the audio using sound device record function for a pre-fixed duration and save the audio file as a wav file.目前我可以使用声音设备录制功能录制一段预先固定的时间并将音频文件保存为 wav 文件。 I have implemented silence detection to trim out unwanted samples.我已经实施了静音检测来修剪掉不需要的样本。 But this is all working after the whole recording is complete.但这一切都在整个录音完成后起作用。 I would like to get the trimmed audio segments immediately while recording simultaneously so that i can do speech recognition in real-time.我想在同时录制的同时立即获取修剪后的音频片段，以便我可以实时进行语音识别。 I'm using python2 and tensorflow 1.14.0.我正在使用 python2 和 tensorflow 1.14.0。 Below is the snippet of what i currently have,以下是我目前拥有的片段，

import sounddevice as sd
import matplotlib.pyplot as plt
import time
#import tensorflow.keras.backend as K
import numpy as np 
from scipy.io.wavfile import write
from scipy.io.wavfile import read
from scipy.io import wavfile
from pydub import AudioSegment
import cv2
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
tf.compat.v1.enable_v2_behavior()
from contextlib import closing
import multiprocessing 

models=['model1.h5','model2.h5','model3.h5','model4.h5','model5.h5']
loaded_models=[]

for model in models:
    loaded_models.append(tf.keras.models.load_model(model))

def prediction(model_ip):
    model,t=model_ip
    ret_val=model.predict(t).tolist()[0]
    return ret_val 

print("recording in 5sec")
time.sleep(5)
fs = 44100  # Sample rate
seconds = 10  # Duration of recording
print('recording')
time.sleep(0.5)
myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=1)
sd.wait()
thresh=0.025
gaplimit=9000
wav_file='/home/nick/Desktop/Endpoint/aud.wav'
write(wav_file, fs, myrecording)
fs,myrecording = read(wav_file)[0], read(wav_file)[1]
#Now the silence removal function is called which trims and saves only the useful audio samples in the form of a wav file. This trimmed audio contains the full word which can be recognized. 
end_points(wav_file,thresh,50)

#Below for loop combines the loaded models(I'm using multiple models) with the input in a tuple
for trimmed_aud in trimmed_audio:
    ...
    ... # The trimmed audio is processed further and the input which the model can predict 
          #is t 
    ...
    modelon=[]
    for md in loaded_models:
        modelon.append((md,t))
    start_time=time.time()
    with closing(multiprocessing.Pool()) as p:
        predops=p.map(prediction,modelon)
    print('Total time taken: {}'.format(time.time() - start_time))          
    actops=[]
    for predop in predops:
        actops.append(predop.index(max(predop)))
    print(actops)
    max_freqq = max(set(actops), key = actops.count) 
    final_ans+=str(max_freqq)
print("Output: {}".format(final_ans))

Note that the above code only includes what is relevant to the question and will not run.请注意，上面的代码仅包含与问题相关的内容，不会运行。 I wanted to give an overview of what i have so far and would really appreciate your inputs on how i can proceed to be able to record and trim audio based on a threshold simultaneously so that if multiple words are spoken within the recording duration of 10 seconds(seconds variable in code), as i speak when the energy of the samples for a window size of 50ms goes below a certain threshold i cut the audio at those two points, trim and use it for prediction.我想概述到目前为止我所拥有的内容，并且非常感谢您对我如何继续能够根据阈值同时录制和修剪音频的意见，以便如果在 10 秒的录制时间内说出多个单词（代码中的秒数变量），正如我所说，当 50 毫秒窗口大小的样本能量低于某个阈值时，我在这两个点处剪切音频，修剪并将其用于预测。 Both recording and prediction of trimmed audio segments must happen simultaneously so that the each output word can be displayed immediately after its utterance during the 10 seconds of recording.剪辑音频片段的录制和预测必须同时进行，以便在 10 秒的录制过程中，每个输出词可以在其发声后立即显示。 Would really appreciate any suggestions on how I can go about this.非常感谢有关我如何解决此问题的任何建议。

Answer 1

Hard to say what your model architecture is but there are models specifically designed for streaming recognition.很难说你的模型架构是什么，但有专门为流识别设计的模型。 Like Facebook's streaming convnets .就像Facebook 的流媒体网络。 You won't be able to implement them in Keras easily though.但是，您将无法在 Keras 中轻松实现它们。

如何在python中录制时同时读取音频样本以进行实时语音到文本转换？

问题描述

1 个解决方案

解决方案1
0 2020-03-31 16:40:47

如何在python中录制时同时读取音频样本以进行实时语音到文本转换？

问题描述

1 个解决方案

解决方案1 0 2020-03-31 16:40:47

解决方案1
0 2020-03-31 16:40:47