简体   繁体   中英

Save AWS Transcribe JSON Output

I'm trying to send an AMS Lambda function to AMS Transcribe to transcribe an audio document. I then want to save this audio document back in the original AMS Lambda function. I know it would likely be easier to instead send it to an S3 Bucket, but that's not part of the brief.

Everything is happening successfully, but I need to be able to access the contents of the transcription for the next step (passing the text into AMS Comprehend). The AMS Transcribe job is created successfully, and when I click to download the transcript, a JSON file downloads with the following contents: -

{"jobName":"MY_FILE_NAME","accountId":"MY_ID","results":{"transcripts":[{"transcript":"MY TRANSCRIPT"}],"items":[{"start_time":"0.04","end_time":"0.35","alternatives":[{"confidence":"1.0","content":"better"}],"type":"pronunciation"},{"start_time":"0.35","end_time":"0.71","alternatives":[{"confidence":"1.0","content":"three"}],"type":"pronunciation"},{"start_time":"0.71","end_time":"1.09","alternatives":[{"confidence":"1.0","content":"hours"}],"type":"pronunciation"},{"start_time":"1.09","end_time":"1.29","alternatives":[{"confidence":"1.0","content":"too"}],"type":"pronunciation"},{"start_time":"1.29","end_time":"1.58","alternatives":[{"confidence":"1.0","content":"soon"}],"type":"pronunciation"},{"start_time":"1.58","end_time":"1.73","alternatives":[{"confidence":"0.9991","content":"than"}],"type":"pronunciation"},{"start_time":"1.73","end_time":"1.79","alternatives":[{"confidence":"0.9421","content":"a"}],"type":"pronunciation"},{"start_time":"1.79","end_time":"2.16","alternatives":[{"confidence":"0.9312","content":"minute"}],"type":"pronunciation"},{"start_time":"2.16","end_time":"2.37","alternatives":[{"confidence":"0.925","content":"too"}],"type":"pronunciation"},{"start_time":"2.37","end_time":"2.76","alternatives":[{"confidence":"0.9973","content":"late"}],"type":"pronunciation"},{"alternatives":[{"confidence":"0.0","content":"."}],"type":"punctuation"}]},"status":"COMPLETED"}

This file would be great, as I'd be able to get the transcript from the JSON file. However, even when I get the results of my transcription (which are the exact same as the URL to download the JSON file), it doesn't appear that I'm able to read them in JSON format. Here is my code. I've included the process of transcribing, but s3bucket and s3object come from earlier parts of the code.

#CREATE TRANSCRIBE JOB
jobName = s3object + '-' + str(uuid.uuid4())

client = boto3.client('transcribe')

response = client.start_transcription_job(
    TranscriptionJobName=jobName,
    LanguageCode='en-US',
    MediaFormat='mp3',
    Media={
        'MediaFileUri': "s3://" + s3bucket + "/" + str(s3object)
    },
)

#TESTING
print(response['TranscriptionJob']['TranscriptionJobName'])
time.sleep(50)
print(response)

# GET TRANSCRIBE FILE
while True:
    result = client.get_transcription_job(TranscriptionJobName=jobName)
    if result['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
      break
    time.sleep(15)
if result['TranscriptionJob']['TranscriptionJobStatus'] == "COMPLETED":
    data = result['TranscriptionJob']['Transcript']['TranscriptFileUri']
    data = json.loads(data)
    print(data)

When I print data, I get the following (which is also the URL to download the file)

https://s3.eu-west-2.amazonaws.com/aws-transcribe-eu-west-2-prod/21040557774/MY_FILE_NAME/8127c3b7-dcdf-4f64-8331-e61c7219c942/asrOutput.json?X-Amz-Security-Token=LONG_SECURITY_TOKEN&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20210326T193652Z&X-Amz-SignedHeaders=host&X-Amz-Expires=900&X-Amz-Credential=MY_CREDENTIAL%2Feu-west-2%2Fs3%2Faws4_request&X-Amz-Signature=AMZ_SIGNATURE

As this file downloads as a JSON file, I thought I could simply import it to my code via JSON. That's where data = json.loads(data) comes from.

However, when I run this line, I get: -

[ERROR] JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 51, in lambda_handler
data = json.loads(data)

I know there's the potential to use pandas, but I'm using the AWS CLI and I spent about 2 hours looking at different tutorials, each one offering line for line advice on how to get pandas working, each one breaking halfway through, so if at all avoidable I'd like to open it without having to go beyond a simple import, but if there's no other way then I understand.

Thanks!

You cannot use data = json.loads(data) since data is a url and not a formated Json string

try downloading it (for instance with requests library)

import requests
data = requests.get(data).json()

Thanks to @Antonin Riche for helping me understand the context of what I was trying to do.

As he explained, I was trying to read in a url, not a json file. As he started, requests() does this, but it isn't a part of the supported imports on AWS Lambda.

import urllib3
    
#TESTING
print(response['TranscriptionJob']['TranscriptionJobName'])
time.sleep(50)
print(response)

# GET TRANSCRIBE FILE
while True:
    result = client.get_transcription_job(TranscriptionJobName=jobName)
    if result['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
      break
    time.sleep(15)
if result['TranscriptionJob']['TranscriptionJobStatus'] == "COMPLETED":
    data = result['TranscriptionJob']['Transcript']['TranscriptFileUri']
    http = urllib3.PoolManager()
    data = http.request('GET', result['TranscriptionJob']['Transcript']['TranscriptFileUri'])
    data = json.loads(data.data.decode('utf-8'))

This allows me to read in the http and convert it to JSON like Antonin suggested. I'll then be able to access the JSON file like normal and access my transcript. In this case, that was;

print(data['results']['transcripts'][0]['transcript'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM