Azure Speech-To-Text 多语音识别

Question

I'm trying to transcribe a conversation audio file into text with Azure's SpeechToText.我正在尝试使用 Azure 的 SpeechToText 将对话音频文件转录为文本。 I got it making use of the SKD and did another try with the API (following this instructions https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/batch/python/python-client/main.py ) but I also want to split the result text by the different voices.我使用 SKD 并使用 API 进行了另一次尝试（按照此说明https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/batch/python/python -client/main.py ) 但我也想通过不同的声音分割结果文本。 Is it possible?是否可以？

I know it is available on beta the conversation service, but as my audios are in spanish, I can't use it.我知道它在测试版对话服务中可用，但由于我的音频是西班牙语，我无法使用它。 Is there a configuration to split result by speakers?是否有通过扬声器分割结果的配置？

This is the call with SDK:这是与 SDK 的调用：

all_results = []
def speech_recognize_continuous_from_file(file_to_transcript):
    """performs continuous speech recognition with input from an audio file"""
    # <SpeechContinuousRecognitionWithFile>
    speech_config = speechsdk.SpeechConfig(subscription=speech_key,
                                           region=service_region,
                                           speech_recognition_language='es-ES')
    audio_config = speechsdk.audio.AudioConfig(filename=file_to_transcribe)

    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    done = False

    def stop_cb(evt):
        """callback that stops continuous recognition upon receiving an event `evt`"""
        print('CLOSING on {}'.format(evt))
        speech_recognizer.stop_continuous_recognition()
        nonlocal done
        done = True

    # Connect callbacks to the events fired by the speech recognizer  

    speech_recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt)))
    speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
    speech_recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
    speech_recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))
    # stop continuous recognition on either session stopped or canceled events
    speech_recognizer.session_stopped.connect(stop_cb)
    speech_recognizer.canceled.connect(stop_cb)

    def handle_final_result(evt):
        all_results.append(evt.result.text)

    speech_recognizer.recognized.connect(handle_final_result)
    # Start continuous speech recognition
    speech_recognizer.start_continuous_recognition()



    while not done:
        time.sleep(.5)
    # </SpeechContinuousRecognitionWithFile>

And this with the API:这与 API：

from __future__ import print_function
from typing import List

import logging
import sys
import requests
import time
import swagger_client as cris_client


logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, format="%(message)s")

SUBSCRIPTION_KEY = subscription_key

HOST_NAME = "westeurope.cris.ai"
PORT = 443

NAME = "Simple transcription"
DESCRIPTION = "Simple transcription description"

LOCALE = "es-ES"
RECORDINGS_BLOB_URI = bobl_url
# ADAPTED_ACOUSTIC_ID = None  # guid of a custom acoustic model
# ADAPTED_LANGUAGE_ID = None  # guid of a custom language model


def transcribe():
    logging.info("Starting transcription client...")

    # configure API key authorization: subscription_key
    configuration = cris_client.Configuration()
    configuration.api_key['Ocp-Apim-Subscription-Key'] = SUBSCRIPTION_KEY

    # create the client object and authenticate
    client = cris_client.ApiClient(configuration)

    # create an instance of the transcription api class
    transcription_api = cris_client.CustomSpeechTranscriptionsApi(api_client=client)

    # get all transcriptions for the subscription
    transcriptions: List[cris_client.Transcription] = transcription_api.get_transcriptions()

    logging.info("Deleting all existing completed transcriptions.")

    # delete all pre-existing completed transcriptions
    # if transcriptions are still running or not started, they will not be deleted
    for transcription in transcriptions:
        transcription_api.delete_transcription(transcription.id)

    logging.info("Creating transcriptions.")

    # transcription definition using custom models
#     transcription_definition = cris_client.TranscriptionDefinition(
#         name=NAME, description=DESCRIPTION, locale=LOCALE, recordings_url=RECORDINGS_BLOB_URI,
#         models=[cris_client.ModelIdentity(ADAPTED_ACOUSTIC_ID), cris_client.ModelIdentity(ADAPTED_LANGUAGE_ID)]
#     )

    # comment out the previous statement and uncomment the following to use base models for transcription
    transcription_definition = cris_client.TranscriptionDefinition(
         name=NAME, description=DESCRIPTION, locale=LOCALE, recordings_url=RECORDINGS_BLOB_URI
     )

    data, status, headers = transcription_api.create_transcription_with_http_info(transcription_definition)

    # extract transcription location from the headers
    transcription_location: str = headers["location"]

    # get the transcription Id from the location URI
    created_transcriptions = list()
    created_transcriptions.append(transcription_location.split('/')[-1])

    logging.info("Checking status.")

    completed, running, not_started = 0, 0, 0

    while completed < 1:
        # get all transcriptions for the user
        transcriptions: List[cris_client.Transcription] = transcription_api.get_transcriptions()

        # for each transcription in the list we check the status
        for transcription in transcriptions:
            if transcription.status == "Failed" or transcription.status == "Succeeded":
                # we check to see if it was one of the transcriptions we created from this client
                if transcription.id not in created_transcriptions:
                    continue

                completed += 1

                if transcription.status == "Succeeded":
                    results_uri = transcription.results_urls["channel_0"]
                    results = requests.get(results_uri)
                    logging.info("Transcription succeeded. Results: ")
                    logging.info(results.content.decode("utf-8"))
            elif transcription.status == "Running":
                running += 1
            elif transcription.status == "NotStarted":
                not_started += 1

        logging.info(f"Transcriptions status: {completed} completed, {running} running, {not_started} not started yet")
        # wait for 5 seconds
        time.sleep(5)

    input("Press any key...")


def main():
    transcribe()


if __name__ == "__main__":
    main()

Answer 1

I also want to split the result text by the different voices.我还想通过不同的声音分割结果文本。

The transcript received does not contains any notion of speaker.收到的成绩单不包含任何说话者的概念。 Here you are just calling an endpoint doing transcription, there is no speaker recognition feature inside.在这里，您只是调用一个进行转录的端点，内部没有说话人识别功能。

Two things:两件事情：

If your audio has separate channels for each speaker, then you will have your result (see transcript results_urls channels)如果您的音频对每个扬声器都有单独的频道，那么您将获得结果（请参阅成绩单results_urls频道）
If not, you may use Speaker Recognition API (doc here ) to do this identification but:如果没有，您可以使用Speaker Recognition API （此处为文档）来进行此识别，但是：
it needs some training first它首先需要一些培训
you don't have the offsets in the reply, so it will be complicated to map with your transcript result您在回复中没有偏移量，因此映射您的成绩单结果会很复杂

As you mentioned, the Speech SDK's ConversationTranscriber API (doc here ) is currently limited to en-US and zh-CN languages正如您所提到的， Speech SDK's ConversationTranscriber API （此处为文档）目前仅限于en-US和zh-CN语言

Answer 2

Contrary to the previous answer, I did get a result where speakers are recognized without any further training or other difficulties.与之前的答案相反，我确实得到了一个结果，即无需任何进一步培训或其他困难即可识别演讲者。 I followed this Github issue:我关注了这个 Github 问题：

https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/286 https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/286

Which lead me to the following change:这导致我发生以下变化：

transcription_definition = cris_client.TranscriptionDefinition(
    name=NAME, description=DESCRIPTION, locale=LOCALE, recordings_url=RECORDINGS_BLOB_URI,
    properties={"AddDiarization": "True"}
)

Which gives the desired result.这给出了预期的结果。

Azure Speech-To-Text 多语音识别

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-06-07 13:52:24

解决方案2
0 2019-10-30 13:33:25

Azure Speech-To-Text 多语音识别

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-06-07 13:52:24

解决方案2 0 2019-10-30 13:33:25

解决方案1
1 已采纳 2019-06-07 13:52:24

解决方案2
0 2019-10-30 13:33:25