简体   繁体   English

通过Google Cloud Speech API获取每个转录单词的时间戳吗?

[英]Getting timestamps for each transcribed word through Google Cloud Speech API?

I'm hoping to transcribe an audio file via the Google Cloud Speech API. 我希望通过Google Cloud Speech API转录音频文件。 This simple script takes a wav as input and transcribes it with pretty high accuracy. 这个简单的脚本将wav作为输入,并以非常高的准确性进行转录。

import os
import sys
import speech_recognition as sr

with open("~/Documents/speech-to-text/speech2textgoogleapi.json") as f:
  GOOGLE_CLOUD_SPEECH_CREDENTIALS = f.read()
name = sys.argv[1] # wav file
r = sr.Recognizer()
all_text = []
with sr.AudioFile(name) as source:
  audio = r.record(source)
  # Transcribe audio file
  text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS)
all_text.append(text)
with open("~/Documents/speech-to-text/transcript.txt", "w") as f:
  f.write(str(all_text))

How can I use the API to extract other meaningful information from the speech audio? 如何使用API​​从语音音频中提取其他有意义的信息? Specifically, I'm looking to get a timestamp for each word, but other info (eg. pitch, amplitude, speaker recognition, etc.) would be extremely welcome. 具体来说,我希望为每个单词获取一个时间戳,但是其他信息(例如音调,幅度,说话者识别等)也将非常受欢迎。 Thanks in advance! 提前致谢!

There is actually an example on how to do this in the Speech API in 实际上,在语音API中,有一个有关如何执行此操作的示例

Using Time offsets(TimeStamps) : 使用时间偏移量(TimeStamps)

Time offset (timestamp) values can be included in the response text for your recognize request. 时间偏移(时间戳)值可以包含在您的识别请求的响应文本中。 Time offset values show the beginning and end of each spoken word that is recognized in the supplied audio. 时间偏移值显示在提供的音频中识别的每个口语单词的开头和结尾。 A time offset value represents the amount of time that has elapsed from the beginning of the audio, in increments of 100ms. 时间偏移量值表示从音频开始经过的时间量,以100ms为增量。

Time offsets are especially useful for analyzing longer audio files, where you may need to search for a particular word in the recognized text and locate it (seek) in the original audio. 时间偏移量对于分析较长的音频文件特别有用,在这种情况下,您可能需要在识别的文本中搜索特定单词,然后在原始音频中定位(查找)该单词。 Time offsets are supported for all our recognition methods: recognize, streamingrecognize, and longrunningrecognize. 我们所有的识别方法都支持时间偏移:识别,流式识别和长时间运行的识别。 See below for an example of longrunningrecognize..... 参见下面的示例longrunningrecognize .....

This is the code sample for Python: 这是Python的代码示例:

def transcribe_gcs_with_word_time_offsets(gcs_uri):
    """Transcribe the given audio file asynchronously and output the word time
    offsets."""
    from google.cloud import speech
    from google.cloud.speech import enums
    from google.cloud.speech import types
    client = speech.SpeechClient()

    audio = types.RecognitionAudio(uri=gcs_uri)
    config = types.RecognitionConfig(
        encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
        sample_rate_hertz=16000,
        language_code='en-US',
        enable_word_time_offsets=True)

    operation = client.long_running_recognize(config, audio)

    print('Waiting for operation to complete...')
    result = operation.result(timeout=90)

    for result in result.results:
        alternative = result.alternatives[0]
        print('Transcript: {}'.format(alternative.transcript))
        print('Confidence: {}'.format(alternative.confidence))

        for word_info in alternative.words:
            word = word_info.word
            start_time = word_info.start_time
            end_time = word_info.end_time
            print('Word: {}, start_time: {}, end_time: {}'.format(
                word,
                start_time.seconds + start_time.nanos * 1e-9,
                end_time.seconds + end_time.nanos * 1e-9))

Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM