简体   繁体   中英

Python IBM Watson Speech to Text API Convert Transcript to CSV

I am using the IBM Watson speech to text API in Python and storing the JSON response as a nested dictionary. I can access a single record using pprint(data_response['results'][0]['alternatives'][0]['transcript']) but cannot print all transcripts. I need to dump the entire transcript into a .csv. I have tried using generator comprehension using the same format suggested to me in another post using print(a["confidence"] for r in data_response["results"] for a in r["alternatives"]) , but I must not be understanding how generator comprehension works.

Here is what the nested dictionary looks like using pretty print:

{'result_index': 0,
 'results': [{'alternatives': [{'confidence': 0.99, 'transcript': 'hello '}],
              'final': True},
             {'alternatives': [{'confidence': 0.9,
                                'transcript': 'good morning any this is '}],
              'final': True},
             {'alternatives': [{'confidence': 0.59,
                                'transcript': "I'm on a recorded morning "
                                              '%HESITATION today start running '
                                              "yeah it's really good how are "
                                              "you %HESITATION it's one three "
                                              'six thank you so much for '
                                              'asking '}],
              'final': True},
             {'alternatives': [{'confidence': 0.87,
                                'transcript': 'I appreciate this opportunity '
                                              'to get together with you and '
                                              '%HESITATION you know learn more '
                                              'about you your interest in '}],
              'final': True},

edit: here was my final solution to convert a list of .pkl files to .csv files using the response from @SeaChange that helped with exporting only the transcript portion of the nested dictionary. I'm sure there were more efficient ways for me to convert the files, but it worked great for my application.

# set the input path
input_path = "00_data\Watson Responses"

# set the output path
output_path = "00_data\Watson Scripts"

# set the list of all files in the input path with a file ending of pkl
files = [f for f in glob.glob(input_path + "**/*.pkl", recursive=True)]

# open each pkl file, convert the list to a dataframe, and export to a csv
for file in files:
    base_name = os.path.basename(file)
    f_name, f_ext = os.path.splitext(base_name)
    pkl_file = open(join(dirname(__file__), input_path, base_name), 'rb')
    data_response = pickle.load(pkl_file)
    pkl_file.close()
    transcripts = [a["transcript"] for r in data_response["results"] for a in r["alternatives"]]
    dataframe = pd.DataFrame(transcripts)
    dataframe.to_csv(os.path.join(output_path, f'{f_name}.csv'), index = False, header = False)
transcripts = [a["transcript"] for r in data_response["results"] for a in r["alternatives"]]

That gives you a list of all the transcripts. At that point it just depends on how you want the output file formatted. If you want each transcript on a new line you can use writelines for that.

writelines

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM