简体   繁体   中英

How to extract data from field in json line format and store it in a new file in python as a text

I have json file that looks like this:

{"reviewerID": "A11N155CW1UV02", "asin": "B000H00VBQ", "reviewerName": "AdrianaM", "helpful": [0, 0], "reviewText": "I had big expectations because I love English TV, in particular Investigative and detective stuff but this guy is really boring. It didn't appeal to me at all.", "overall": 2.0, "summary": "A little bit boring for me", "unixReviewTime": 1399075200, "reviewTime": "05 3, 2014"}
{"reviewerID": "A3BC8O2KCL29V2", "asin": "B000H00VBQ", "reviewerName": "Carol T", "helpful": [0, 0], "reviewText": "I highly recommend this series. It is a must for anyone who is yearning to watch \"grown up\" television. Complex characters and plots to keep one totally involved. Thank you Amazin Prime.", "overall": 5.0, "summary": "Excellent Grown Up TV", "unixReviewTime": 1346630400, "reviewTime": "09 3, 2012"}
{"reviewerID": "A60D5HQFOTSOM", "asin": "B000H00VBQ", "reviewerName": "Daniel Cooper \"dancoopermedia\"", "helpful": [0, 1], "reviewText": "This one is a real snoozer. Don't believe anything you read or hear, it's awful. I had no idea what the title means. Neither will you.", "overall": 1.0, "summary": "Way too boring for me", "unixReviewTime": 1381881600, "reviewTime": "10 16, 2013"}

I need to extract data from fields "summary" and "reviewText" and store it in two new files for further analysis, like tokenization.

I am trying this:

import json
rt = open("review.txt", "a") #creates new file for storage
su = open("summary.txt", "a")


with open("/Users/anano/Desktop/MAXWELL/SPRING/NLP/Amazon_Instant_Video_5.json") as json_file:
    for line in json_file: #runs the loop to extract info
        data = json.loads(line)
        rt.write(data['reviewText'])
        su.write(data['summary'])
        rt.close()
        su.closed()

Because sentences in summary do not have suspension points (dots) at the end, it saves all strings as one sentence, like this:

A little bit boring for meExcellent Grown Up TVWay too boring for meRobson Green is mesmerizing

This makes tokenization impossible. How can I sove this problem?

All you need to do is adding \n to end of sentences. (\n is an escape character for strings that is replaced with the new line object)

So, your code evulates to this:

import json
rt = open("review.txt", "a") #creates new file for storage
su = open("summary.txt", "a")

with open("/Users/anano/Desktop/MAXWELL/SPRING/NLP/Amazon_Instant_Video_5.json") as json_file:
    for line in json_file: #runs the loop to extract info
        data = json.loads(line)
        rt.write(data['reviewText'] + '\n')
        su.write(data['summary'] + '\n')

    rt.close()
    su.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM