简体   繁体   English

Azure Speech-To-Text 多语音识别

[英]Azure Speech-To-Text multiple voice recognition

I'm trying to transcribe a conversation audio file into text with Azure's SpeechToText.我正在尝试使用 Azure 的 SpeechToText 将对话音频文件转录为文本。 I got it making use of the SKD and did another try with the API (following this instructions https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/batch/python/python-client/main.py ) but I also want to split the result text by the different voices.我使用 SKD 并使用 API 进行了另一次尝试(按照此说明https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/batch/python/python -client/main.py ) 但我也想通过不同的声音分割结果文本。 Is it possible?是否可以?

I know it is available on beta the conversation service, but as my audios are in spanish, I can't use it.我知道它在测试版对话服务中可用,但由于我的音频是西班牙语,我无法使用它。 Is there a configuration to split result by speakers?是否有通过扬声器分割结果的配置?

This is the call with SDK:这是与 SDK 的调用:

all_results = []
def speech_recognize_continuous_from_file(file_to_transcript):
    """performs continuous speech recognition with input from an audio file"""
    # <SpeechContinuousRecognitionWithFile>
    speech_config = speechsdk.SpeechConfig(subscription=speech_key,
                                           region=service_region,
                                           speech_recognition_language='es-ES')
    audio_config = speechsdk.audio.AudioConfig(filename=file_to_transcribe)

    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    done = False

    def stop_cb(evt):
        """callback that stops continuous recognition upon receiving an event `evt`"""
        print('CLOSING on {}'.format(evt))
        speech_recognizer.stop_continuous_recognition()
        nonlocal done
        done = True

    # Connect callbacks to the events fired by the speech recognizer  

    speech_recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt)))
    speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
    speech_recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
    speech_recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))
    # stop continuous recognition on either session stopped or canceled events
    speech_recognizer.session_stopped.connect(stop_cb)
    speech_recognizer.canceled.connect(stop_cb)

    def handle_final_result(evt):
        all_results.append(evt.result.text)

    speech_recognizer.recognized.connect(handle_final_result)
    # Start continuous speech recognition
    speech_recognizer.start_continuous_recognition()



    while not done:
        time.sleep(.5)
    # </SpeechContinuousRecognitionWithFile>

And this with the API:这与 API:

from __future__ import print_function
from typing import List

import logging
import sys
import requests
import time
import swagger_client as cris_client


logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, format="%(message)s")

SUBSCRIPTION_KEY = subscription_key

HOST_NAME = "westeurope.cris.ai"
PORT = 443

NAME = "Simple transcription"
DESCRIPTION = "Simple transcription description"

LOCALE = "es-ES"
RECORDINGS_BLOB_URI = bobl_url
# ADAPTED_ACOUSTIC_ID = None  # guid of a custom acoustic model
# ADAPTED_LANGUAGE_ID = None  # guid of a custom language model


def transcribe():
    logging.info("Starting transcription client...")

    # configure API key authorization: subscription_key
    configuration = cris_client.Configuration()
    configuration.api_key['Ocp-Apim-Subscription-Key'] = SUBSCRIPTION_KEY

    # create the client object and authenticate
    client = cris_client.ApiClient(configuration)

    # create an instance of the transcription api class
    transcription_api = cris_client.CustomSpeechTranscriptionsApi(api_client=client)

    # get all transcriptions for the subscription
    transcriptions: List[cris_client.Transcription] = transcription_api.get_transcriptions()

    logging.info("Deleting all existing completed transcriptions.")

    # delete all pre-existing completed transcriptions
    # if transcriptions are still running or not started, they will not be deleted
    for transcription in transcriptions:
        transcription_api.delete_transcription(transcription.id)

    logging.info("Creating transcriptions.")

    # transcription definition using custom models
#     transcription_definition = cris_client.TranscriptionDefinition(
#         name=NAME, description=DESCRIPTION, locale=LOCALE, recordings_url=RECORDINGS_BLOB_URI,
#         models=[cris_client.ModelIdentity(ADAPTED_ACOUSTIC_ID), cris_client.ModelIdentity(ADAPTED_LANGUAGE_ID)]
#     )

    # comment out the previous statement and uncomment the following to use base models for transcription
    transcription_definition = cris_client.TranscriptionDefinition(
         name=NAME, description=DESCRIPTION, locale=LOCALE, recordings_url=RECORDINGS_BLOB_URI
     )

    data, status, headers = transcription_api.create_transcription_with_http_info(transcription_definition)

    # extract transcription location from the headers
    transcription_location: str = headers["location"]

    # get the transcription Id from the location URI
    created_transcriptions = list()
    created_transcriptions.append(transcription_location.split('/')[-1])

    logging.info("Checking status.")

    completed, running, not_started = 0, 0, 0

    while completed < 1:
        # get all transcriptions for the user
        transcriptions: List[cris_client.Transcription] = transcription_api.get_transcriptions()

        # for each transcription in the list we check the status
        for transcription in transcriptions:
            if transcription.status == "Failed" or transcription.status == "Succeeded":
                # we check to see if it was one of the transcriptions we created from this client
                if transcription.id not in created_transcriptions:
                    continue

                completed += 1

                if transcription.status == "Succeeded":
                    results_uri = transcription.results_urls["channel_0"]
                    results = requests.get(results_uri)
                    logging.info("Transcription succeeded. Results: ")
                    logging.info(results.content.decode("utf-8"))
            elif transcription.status == "Running":
                running += 1
            elif transcription.status == "NotStarted":
                not_started += 1

        logging.info(f"Transcriptions status: {completed} completed, {running} running, {not_started} not started yet")
        # wait for 5 seconds
        time.sleep(5)

    input("Press any key...")


def main():
    transcribe()


if __name__ == "__main__":
    main()


I also want to split the result text by the different voices.我还想通过不同的声音分割结果文本。

The transcript received does not contains any notion of speaker.收到的成绩单不包含任何说话者的概念。 Here you are just calling an endpoint doing transcription, there is no speaker recognition feature inside.在这里,您只是调用一个进行转录的端点,内部没有说话人识别功能。

Two things:两件事情:

  • If your audio has separate channels for each speaker, then you will have your result (see transcript results_urls channels)如果您的音频对每个扬声器都有单独的频道,那么您将获得结果(请参阅成绩单results_urls频道)
  • If not, you may use Speaker Recognition API (doc here ) to do this identification but:如果没有,您可以使用Speaker Recognition API此处为文档)来进行此识别,但是:
  • it needs some training first它首先需要一些培训
  • you don't have the offsets in the reply, so it will be complicated to map with your transcript result您在回复中没有偏移量,因此映射您的成绩单结果会很复杂

As you mentioned, the Speech SDK's ConversationTranscriber API (doc here ) is currently limited to en-US and zh-CN languages正如您所提到的, Speech SDK's ConversationTranscriber API此处为文档)目前仅限于en-USzh-CN语言

Contrary to the previous answer, I did get a result where speakers are recognized without any further training or other difficulties.与之前的答案相反,我确实得到了一个结果,即无需任何进一步培训或其他困难即可识别演讲者。 I followed this Github issue:我关注了这个 Github 问题:

https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/286 https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/286

Which lead me to the following change:这导致我发生以下变化:

transcription_definition = cris_client.TranscriptionDefinition(
    name=NAME, description=DESCRIPTION, locale=LOCALE, recordings_url=RECORDINGS_BLOB_URI,
    properties={"AddDiarization": "True"}
)

Which gives the desired result.这给出了预期的结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Azure 语音转文本 - 连续识别 - Azure speech-to-text - Continuos Recognition 语音识别音量 - Speech recognition volume of voice 使用 Google Speech-to-Text 进行流式语音识别会导致不正确的时间戳记录 - Streaming speech recognition with Google Speech-to-Text is leading to improperly timestamped transcripts 字幕/字幕与 Microsoft Azure Python 中的语音到文本 - Subtitles/captions with Microsoft Azure Speech-to-text in Python 编辑 Azure Python 代码以清理 Speech-to-Text 输出 - Edit Azure Python code to clean up Speech-to-Text output 将 python-sounddevice.RawInputStream 生成的音频数据发送到 Google Cloud Speech-to-Text 进行异步识别 - Sending audio data generated by python-sounddevice.RawInputStream to Google Cloud Speech-to-Text for asynchronous recognition Kaldi 是否返回任何识别置信度参数,类似于 Google Speech-To-Text API? - Does Kaldi return any recognition confidence parameter, similar to Google Speech-To-Text API? IBM Speech-To-Text 的输出 - Output of IBM Speech-To-Text 有没有办法通过使用语音回答问题,使用 Python 语音识别将输入作为语音到文本 - Is there a way to make the input as speech to text with Python speech recognition by using voice to answer the questions Azure 可以使用 Speech-To-Text SDK 将 base64 编码的音频文件转换为文本吗? - Can Azure turn base64 encoded audio file to text using the Speech-To-Text SDK?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM