简体   繁体   中英

How to simultaneously read audio samples while recording in python for real-time speech to text conversion?

Basically I have trained a few models using keras to do isolated word recognition. Currently i can record the audio using sound device record function for a pre-fixed duration and save the audio file as a wav file. I have implemented silence detection to trim out unwanted samples. But this is all working after the whole recording is complete. I would like to get the trimmed audio segments immediately while recording simultaneously so that i can do speech recognition in real-time. I'm using python2 and tensorflow 1.14.0. Below is the snippet of what i currently have,

import sounddevice as sd
import matplotlib.pyplot as plt
import time
#import tensorflow.keras.backend as K
import numpy as np 
from scipy.io.wavfile import write
from scipy.io.wavfile import read
from scipy.io import wavfile
from pydub import AudioSegment
import cv2
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
tf.compat.v1.enable_v2_behavior()
from contextlib import closing
import multiprocessing 

models=['model1.h5','model2.h5','model3.h5','model4.h5','model5.h5']
loaded_models=[]

for model in models:
    loaded_models.append(tf.keras.models.load_model(model))

def prediction(model_ip):
    model,t=model_ip
    ret_val=model.predict(t).tolist()[0]
    return ret_val 

print("recording in 5sec")
time.sleep(5)
fs = 44100  # Sample rate
seconds = 10  # Duration of recording
print('recording')
time.sleep(0.5)
myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=1)
sd.wait()
thresh=0.025
gaplimit=9000
wav_file='/home/nick/Desktop/Endpoint/aud.wav'
write(wav_file, fs, myrecording)
fs,myrecording = read(wav_file)[0], read(wav_file)[1]
#Now the silence removal function is called which trims and saves only the useful audio samples in the form of a wav file. This trimmed audio contains the full word which can be recognized. 
end_points(wav_file,thresh,50)

#Below for loop combines the loaded models(I'm using multiple models) with the input in a tuple
for trimmed_aud in trimmed_audio:
    ...
    ... # The trimmed audio is processed further and the input which the model can predict 
          #is t 
    ...
    modelon=[]
    for md in loaded_models:
        modelon.append((md,t))
    start_time=time.time()
    with closing(multiprocessing.Pool()) as p:
        predops=p.map(prediction,modelon)
    print('Total time taken: {}'.format(time.time() - start_time))          
    actops=[]
    for predop in predops:
        actops.append(predop.index(max(predop)))
    print(actops)
    max_freqq = max(set(actops), key = actops.count) 
    final_ans+=str(max_freqq)
print("Output: {}".format(final_ans))

Note that the above code only includes what is relevant to the question and will not run. I wanted to give an overview of what i have so far and would really appreciate your inputs on how i can proceed to be able to record and trim audio based on a threshold simultaneously so that if multiple words are spoken within the recording duration of 10 seconds(seconds variable in code), as i speak when the energy of the samples for a window size of 50ms goes below a certain threshold i cut the audio at those two points, trim and use it for prediction. Both recording and prediction of trimmed audio segments must happen simultaneously so that the each output word can be displayed immediately after its utterance during the 10 seconds of recording. Would really appreciate any suggestions on how I can go about this.

Hard to say what your model architecture is but there are models specifically designed for streaming recognition. Like Facebook's streaming convnets . You won't be able to implement them in Keras easily though.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM