简体   繁体   English

谷歌云语音到 Python 文本:节省翻译和时间到 JSON

[英]Google cloud speech to text in Python: Save translation and time to JSON

I am using the standard solution to do speech to text processing with time stamps (see code below).我正在使用标准解决方案对带有时间戳的文本进行语音处理(请参阅下面的代码)。 I know from this post that it is possible to add arguments to the gcloud commandline tool, like --format=json .我从这篇文章中知道可以向 gcloud 命令行工具添加参数,例如--format=json

General question : How do I specify those in google.cloud.speech ?一般问题:如何在google.cloud.speech指定那些? I can't seem to find any documentation on Googles site on how to do this with Python.我似乎无法在 Google 网站上找到有关如何使用 Python 执行此操作的任何文档。

Specific question : My aim right now, is to write out a dictionary style JSON file that contains entries for all words, plus their start and end time per word.具体问题:我现在的目标是写出一个字典样式的 JSON 文件,其中包含所有单词的条目,以及每个单词的开始和结束时间。 I realise that I cloud write a hacky solution, but if a option already exists, that would be preferable.我意识到我云编写了一个hacky解决方案,但如果已经存在一个选项,那将是可取的。

Code :代码

def transcribe_file_with_word_time_offsets(speech_file, language):
    """Transcribe the given audio file synchronously and output the word time
    offsets."""
    print("Start")

    from google.cloud import speech
    from google.cloud.speech import enums
    from google.cloud.speech import types

    print("checking credentials")

    client = speech.SpeechClient(credentials=credentials)

    print("Checked")
    with io.open(speech_file, 'rb') as audio_file:
        content = audio_file.read()


    print("audio file read")

    audio = types.RecognitionAudio(content=content)

    print("config start")
    config = types.RecognitionConfig(
            encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
            language_code=language,
            enable_word_time_offsets=True)

    print("Recognizing:")
    response = client.recognize(config, audio)
    print("Recognized")

    for result in response.results:
        alternative = result.alternatives[0]
        print('Transcript: {}'.format(alternative.transcript))

        for word_info in alternative.words:
            word = word_info.word
            start_time = word_info.start_time
            end_time = word_info.end_time
            print('Word: {}, start_time: {}, end_time: {}'.format(
                word,
                start_time.seconds + start_time.nanos * 1e-9,
                end_time.seconds + end_time.nanos * 1e-9))

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__,
        formatter_class=argparse.RawDescriptionHelpFormatter)
    parser.add_argument(dest='path', help='Audio file to be recognized')
    args = parser.parse_args()
    transcribe_file_with_word_time_offsets(args.path, 'en-US')

And here is the hacky solution:这是hacky解决方案:

...
    transcript_dict = {'Word':[], 'start_time': [], 'end_time':[]}

    for result in response.results:
        alternative = result.alternatives[0]
        print('Transcript: {}'.format(alternative.transcript))

        for word_info in alternative.words:
            word = word_info.word
            start_time = word_info.start_time
            end_time = word_info.end_time
            transcript_dict['Word'].append(word)
            transcript_dict['start_time'].append(
                start_time.seconds + start_time.nanos * 1e-9)
            transcript_dict['end_time'].append(
                end_time.seconds + end_time.nanos * 1e-9)

    print(transcript_dict)
...

The solutions using protobuf in the linked question didn't work for me (November 2020), but it led me to this comment , which worked for me with the Speech API:在链接的问题中使用protobuf的解决方案对我不起作用(2020 年 11 月),但它让我看到了这个评论,它对我的​​语音 API 有用:

speech.types.RecognizeResponse.to_json(response)

# alternatively
type(response).to_json(response)

Example例子

from google.cloud import speech_v1 as speech


def transcribe_gcs(gcs_uri):
    client = speech.SpeechClient()

    audio = speech.RecognitionAudio(uri=gcs_uri)
    config = speech.RecognitionConfig(
        language_code="en-US",
    )

    return client.recognize(config=config, audio=audio)


sample_audio_uri = "gs://cloud-samples-tests/speech/brooklyn.flac"

response = transcribe_gcs(sample_audio_uri)
response_json = type(response).to_json(response)


print(response_json)
{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "how old is the Brooklyn Bridge",
          "confidence": 0.98314303,
          "words": []
        }
      ],
      "channelTag": 0
    }
  ]
}

You could try something like:你可以尝试这样的事情:

from google.cloud import speech_v1p1beta1 as speech
import proto

client = speech.SpeechClient()

audio = speech.RecognitionAudio(...)
config = speech.RecognitionConfig(...)
operation = client.long_running_recognize(config=config, audio=audio)
response = operation.result()

response_dict = proto.Message.to_dict(response)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM