使用 azure 語音轉文本時保存麥克風音頻輸入

Question

我目前在我的項目中使用 Azure 語音來發短信。 它直接識別來自麥克風的語音輸入（這是我想要的）並保存文本 output，但我也有興趣保存該音頻輸入以便稍后收聽。 在轉到 Azure 之前，我使用了帶有 recognize_google 的 python 語音識別庫，它允許我使用 get_wav_data() 將輸入保存為 a.wav 文件。 Azure 可以使用類似的東西嗎？ 我閱讀了文檔，但只能找到將音頻文件保存為文本到語音的方法。 我的臨時解決方案是先自己保存音頻輸入，然后在該音頻文件上使用 azure stt 而不是直接使用麥克風輸入，但我擔心這會減慢過程。 有任何想法嗎？ 先感謝您！

Answer 1

我是 Microsoft 演講 SDK 團隊的 Darren。 不幸的是，目前沒有內置支持同時從麥克風進行實時識別並將音頻寫入 WAV 文件。 我們之前已經聽過這個客戶的要求，我們會考慮在未來版本的 Speech SDK 中加入這個功能。

我認為您目前可以做的（這需要您進行一些編程）是使用語音 SDK 和推送 stream。您可以編寫代碼以從麥克風讀取音頻緩沖區並將其寫入 WAV 文件。 同時，您可以將相同的音頻緩沖區推送到 Speech SDK 中進行識別。 我們有Python示例，顯示如何使用語音SDK與推動stream。請參閱function“ secience_recognition_with_with_push_push_push_stream ” /console/speech_sample.py 。 但是，我不熟悉 Python 從麥克風讀取實時音頻緩沖區並寫入 WAV 文件的選項。 達倫

Answer 2

如果您使用 Azure 的speech_recognizer.recognize_once_async() ，您可以同時使用pyaudio捕獲麥克風。 下面是我使用的代碼：

#!/usr/bin/env python3

# enter your output path here:
output_file='/Users/username/micaudio.wav'

import pyaudio, signal, sys, os, requests, wave
pa = pyaudio.PyAudio()
import azure.cognitiveservices.speech as speechsdk

def vocrec_callback(in_data, frame_count, time_info, status):
    global voc_data
    voc_data['frames'].append(in_data)
    return (in_data, pyaudio.paContinue)

def vocrec_start():
    global voc_stream
    global voc_data
    voc_data = {
        'channels':1 if sys.platform == 'darwin' else 2,
        'rate':44100,
        'width':pa.get_sample_size(pyaudio.paInt16),
        'format':pyaudio.paInt16,
        'frames':[]
    }
    voc_stream = pa.open(format=voc_data['format'],
                    channels=voc_data['channels'],
                    rate=voc_data['rate'],
                    input=True,
                    output=False,
                    stream_callback=vocrec_callback)
    
def vocrec_stop():
    voc_stream.close()

def vocrec_write():
    with wave.open(output_file, 'wb') as wave_file:
        wave_file.setnchannels(voc_data['channels'])
        wave_file.setsampwidth(voc_data['width'])
        wave_file.setframerate(voc_data['rate'])
        wave_file.writeframes(b''.join(voc_data['frames']))

class SIGINT_handler():
    def __init__(self):
        self.SIGINT = False
    def signal_handler(self, signal, frame):
        self.SIGINT = True
        print('You pressed Ctrl+C!')
        vocrec_stop()
        quit()

def init_azure():
    global speech_recognizer
    #  ——— check azure keys
    my_speech_key = os.getenv('SPEECH_KEY')
    if my_speech_key is None:
        error_and_quit("Error: No Azure Key.")
    my_speech_region = os.getenv('SPEECH_REGION')
    if my_speech_region is None:
        error_and_quit("Error: No Azure Region.")
    _headers = {
        'Ocp-Apim-Subscription-Key': my_speech_key,
        'Content-type': 'application/x-www-form-urlencoded',
        # 'Content-Length': '0',
    }
    _URL = f"https://{my_speech_region}.api.cognitive.microsoft.com/sts/v1.0/issueToken"
    _response = requests.post(_URL,headers=_headers)
    if not "200" in str(_response):
        error_and_quit("Error: Wrong Azure Key Or Region.")
    #  ——— keys correct. continue
    speech_config = speechsdk.SpeechConfig(subscription=os.environ.get('SPEECH_KEY'),
                                           region=os.environ.get('SPEECH_REGION'))
    audio_config_stt = speechsdk.audio.AudioConfig(use_default_microphone=True)
    speech_config.set_property(speechsdk.PropertyId.SpeechServiceResponse_RequestSentenceBoundary, 'true')
    #  ——— disable profanity filter:
    speech_config.set_property(speechsdk.PropertyId.SpeechServiceResponse_ProfanityOption, "2")
    speech_config.speech_recognition_language="en-US"
    speech_recognizer = speechsdk.SpeechRecognizer(
        speech_config=speech_config,
        audio_config=audio_config_stt)

def error_and_quit(_error):
     print(error)
     quit()

def recognize_speech ():
    vocrec_start()
    print("Say something: ")
    speech_recognition_result = speech_recognizer.recognize_once_async().get()
    print("Recording done.")
    vocrec_stop()
    vocrec_write()
    quit()

handler = SIGINT_handler()
signal.signal(signal.SIGINT, handler.signal_handler)

init_azure()
recognize_speech()

Answer 3

該功能有任何更新嗎？ 有這個就好了。

使用 azure 語音轉文本時保存麥克風音頻輸入

問題描述

2 個解決方案

解決方案1
1 已采納 2022-04-05 15:48:31

解決方案2
0 2023-01-29 09:46:10

解決方案3
-3 2023-01-20 20:33:27

使用 azure 語音轉文本時保存麥克風音頻輸入

問題描述

2 個解決方案

解決方案1 1 已采納 2022-04-05 15:48:31

解決方案2 0 2023-01-29 09:46:10

解決方案3 -3 2023-01-20 20:33:27

解決方案1
1 已采納 2022-04-05 15:48:31

解決方案2
0 2023-01-29 09:46:10

解決方案3
-3 2023-01-20 20:33:27