简体   繁体   English

如何在python中录制时同时读取音频样本以进行实时语音到文本转换?

[英]How to simultaneously read audio samples while recording in python for real-time speech to text conversion?

Basically I have trained a few models using keras to do isolated word recognition.基本上我已经使用 keras 训练了一些模型来进行孤立的单词识别。 Currently i can record the audio using sound device record function for a pre-fixed duration and save the audio file as a wav file.目前我可以使用声音设备录制功能录制一段预先固定的时间并将音频文件保存为 wav 文件。 I have implemented silence detection to trim out unwanted samples.我已经实施了静音检测来修剪掉不需要的样本。 But this is all working after the whole recording is complete.但这一切都在整个录音完成后起作用。 I would like to get the trimmed audio segments immediately while recording simultaneously so that i can do speech recognition in real-time.我想在同时录制的同时立即获取修剪后的音频片段,以便我可以实时进行语音识别。 I'm using python2 and tensorflow 1.14.0.我正在使用 python2 和 tensorflow 1.14.0。 Below is the snippet of what i currently have,以下是我目前拥有的片段,

import sounddevice as sd
import matplotlib.pyplot as plt
import time
#import tensorflow.keras.backend as K
import numpy as np 
from scipy.io.wavfile import write
from scipy.io.wavfile import read
from scipy.io import wavfile
from pydub import AudioSegment
import cv2
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
tf.compat.v1.enable_v2_behavior()
from contextlib import closing
import multiprocessing 

models=['model1.h5','model2.h5','model3.h5','model4.h5','model5.h5']
loaded_models=[]

for model in models:
    loaded_models.append(tf.keras.models.load_model(model))

def prediction(model_ip):
    model,t=model_ip
    ret_val=model.predict(t).tolist()[0]
    return ret_val 

print("recording in 5sec")
time.sleep(5)
fs = 44100  # Sample rate
seconds = 10  # Duration of recording
print('recording')
time.sleep(0.5)
myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=1)
sd.wait()
thresh=0.025
gaplimit=9000
wav_file='/home/nick/Desktop/Endpoint/aud.wav'
write(wav_file, fs, myrecording)
fs,myrecording = read(wav_file)[0], read(wav_file)[1]
#Now the silence removal function is called which trims and saves only the useful audio samples in the form of a wav file. This trimmed audio contains the full word which can be recognized. 
end_points(wav_file,thresh,50)

#Below for loop combines the loaded models(I'm using multiple models) with the input in a tuple
for trimmed_aud in trimmed_audio:
    ...
    ... # The trimmed audio is processed further and the input which the model can predict 
          #is t 
    ...
    modelon=[]
    for md in loaded_models:
        modelon.append((md,t))
    start_time=time.time()
    with closing(multiprocessing.Pool()) as p:
        predops=p.map(prediction,modelon)
    print('Total time taken: {}'.format(time.time() - start_time))          
    actops=[]
    for predop in predops:
        actops.append(predop.index(max(predop)))
    print(actops)
    max_freqq = max(set(actops), key = actops.count) 
    final_ans+=str(max_freqq)
print("Output: {}".format(final_ans))

Note that the above code only includes what is relevant to the question and will not run.请注意,上面的代码仅包含与问题相关的内容,不会运行。 I wanted to give an overview of what i have so far and would really appreciate your inputs on how i can proceed to be able to record and trim audio based on a threshold simultaneously so that if multiple words are spoken within the recording duration of 10 seconds(seconds variable in code), as i speak when the energy of the samples for a window size of 50ms goes below a certain threshold i cut the audio at those two points, trim and use it for prediction.我想概述到目前为止我所拥有的内容,并且非常感谢您对我如何继续能够根据阈值同时录制和修剪音频的意见,以便如果在 10 秒的录制时间内说出多个单词(代码中的秒数变量),正如我所说,当 50 毫秒窗口大小的样本能量低于某个阈值时,我在这两个点处剪切音频,修剪并将其用于预测。 Both recording and prediction of trimmed audio segments must happen simultaneously so that the each output word can be displayed immediately after its utterance during the 10 seconds of recording.剪辑音频片段的录制和预测必须同时进行,以便在 10 秒的录制过程中,每个输出词可以在其发声后立即显示。 Would really appreciate any suggestions on how I can go about this.非常感谢有关我如何解决此问题的任何建议。

Hard to say what your model architecture is but there are models specifically designed for streaming recognition.很难说你的模型架构是什么,但有专门为流识别设计的模型。 Like Facebook's streaming convnets .就像Facebook 的流媒体网络 You won't be able to implement them in Keras easily though.但是,您将无法在 Keras 中轻松实现它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 Python 中的 wav 文件录制音频和语音到文本的转换 - Recording audio and Speech to text conversion using a wav file in Python 从 websocket stream 和 python 和谷歌云语音到文本实时转录音频 - real time transcribe audio from websocket stream with python and google cloud speech to text 使用python进行实时音频信号处理 - Real-time audio signal processing using python 从 javascrit 到 python-django 的实时音频流 - Real-time audio streaming from javascrit to python-django 使用 Watson for Python 实现连续实时语音到文本 - Continuous Real Time Speech to Text with Watson for Python 如何在 python 中进行“实时”打印? - How to make 'real-time' print in python? Python - 实时读取音频输入并同时播放 - Python - Reading audio input and playing it simultaneously in real time python可以实时从目录读取图像吗? - Can python read images from directory in real-time? 如何在Python中将实时数据读入循环与更多处理密集型操作分开? - How to keep a real-time data read-in loop separate from more processing intensive operations in Python? 使用websockets连接到Watson Speech-to-Text API进行实时转录 - Connect to Watson Speech-to-Text API using websockets for real-time transcription
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM