简体   繁体   中英

How to identify speaker using python sdk in using Azure cognitive speech translation API?

I am trying to use modified event based synthesis code sample provided on azure documentation for speech to speech translation. However, during the process, I am also want to identify speakers (speaker1, speaker2) but I don't see an function in Python SDK which will help me in identifying the speakers as a part of speech=to-text translation. Can someone suggest ways to identify speaker during speech-to-text translaion? Below is the code snippet:

def translate_speech_to_text():

    translation_config = speechsdk.translation.SpeechTranslationConfig(subscription=speech_key, region=service_region)
    translation_config.speech_recognition_language = from_language
    translation_config.add_target_language(to_language)
    translation_config.voice_name = "en-GB-Susan"

    translation_config.request_word_level_timestamps()
    translation_config.output_format = speechsdk.OutputFormat(0)

    audio_input = speechsdk.AudioConfig(filename=filename)
    recognizer = speechsdk.translation.TranslationRecognizer(translation_config = translation_config, audio_config = audio_input)

    done = False

    def stop_cb(evt):
        """callback that stops continuous recognition upon receiving an event `evt`"""
        #print('CLOSING on {}'.format(evt))
        recognizer.stop_continuous_recognition()
        nonlocal done
        done = True

    all_results = []
    def handle_final_result(evt):
        #all_results.append(evt.result.text)
        #all_results.append(evt.result.translations['en'])
        all_results.append(evt.result.json)
    
    recognizer.recognized.connect(handle_final_result)
    # Connect callbacks to the events fired by the speech recognizer
    recognizer.recognizing.connect(lambda evt: print('RECOGNIZING: {}'.format(evt)))
    recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt)))
    recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
    recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
    #recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))
    recognizer.session_stopped.connect(stop_cb)
    recognizer.canceled.connect(stop_cb)
    
    def synthesis_callback(evt):
        print('Audio: {}'.format(len(evt.result.audio)))
        print('Reason: {}'.format(evt.result.reason))
        with open('out.wav', 'wb') as wavfile:
            wavfile.write(evt.result.audio)
   
    recognizer.synthesizing.connect(synthesis_callback)
    recognizer.start_continuous_recognition()    

    while not done:
        time.sleep(.5)
    
    print("Printing all results:")
    print(all_results)

translate_speech_to_text()

If you want to identify speaker, you should use Speech Service .

Recommend to use REST API.

Text Independent - Identify Single Speaker

Speech Services has a complete SDK in C# , C++ , JavaScript , REST , and can perform Speaker Recognition . (I searched the Python SDK , but I haven't found a method that can be used directly to identify it.)

Suggestion

1. It is recommended to read the Speech related documents carefully and how to use this service.

2. It is recommended to use request to send http post requests.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM