简体   繁体   中英

Less memory intensive way to parse large JSON file in Python

Here is my code

import json
data = []
with open("review.json") as f:
    for line in f:
        data.append(json.loads(line))

lst_string = []
lst_num = []
for i in range(len(data)):
    if (data[i]["stars"] == 5.0):
        x = data[i]["text"]
        for word in x.split():
            if word in lst_string:
                lst_num[lst_string.index(word)] += 1
            else:
                lst_string.append(word)
                lst_num.append(1)

result = set(zip(lst_string, lst_num))
print(result)
with open("set.txt", "w") as g:
    g.write(str(result))

I'm trying to write a set of all words in reviews that were given 5 stars from a pulled in json file formatted like

{"review_id":"Q1sbwvVQXV2734tPgoKj4Q","user_id":"hG7b0MtEbXx5QzbzE6C_VA","business_id":"ujmEBvifdJM6h6RLv4wQIg","stars":1.0,"useful":6,"funny":1,"cool":0,"text":"Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.","date":"2013-05-07 04:34:36"}
{"review_id":"GJXCdrto3ASJOqKeVWPi6Q","user_id":"yXQM5uF2jS6es16SJzNHfg","business_id":"NZnhc2sEQy3RmzKTZnqtwQ","stars":1.0,"useful":0,"funny":0,"cool":0,"text":"I *adore* Travis at the Hard Rock's new Kelly Cardenas Salon!  I'm always a fan of a great blowout and no stranger to the chains that offer this service; however, Travis has taken the flawless blowout to a whole new level!  \n\nTravis's greets you with his perfectly green swoosh in his otherwise perfectly styled black hair and a Vegas-worthy rockstar outfit.  Next comes the most relaxing and incredible shampoo -- where you get a full head message that could cure even the very worst migraine in minutes --- and the scented shampoo room.  Travis has freakishly strong fingers (in a good way) and use the perfect amount of pressure.  That was superb!  Then starts the glorious blowout... where not one, not two, but THREE people were involved in doing the best round-brush action my hair has ever seen.  The team of stylists clearly gets along extremely well, as it's evident from the way they talk to and help one another that it's really genuine and not some corporate requirement.  It was so much fun to be there! \n\nNext Travis started with the flat iron.  The way he flipped his wrist to get volume all around without over-doing it and making me look like a Texas pagent girl was admirable.  It's also worth noting that he didn't fry my hair -- something that I've had happen before with less skilled stylists.  At the end of the blowout & style my hair was perfectly bouncey and looked terrific.  The only thing better?  That this awesome blowout lasted for days! \n\nTravis, I will see you every single time I'm out in Vegas.  You make me feel beauuuutiful!","date":"2017-01-14 21:30:33"}
{"review_id":"2TzJjDVDEuAW6MR5Vuc1ug","user_id":"n6-Gk65cPZL6Uz8qRm3NYw","business_id":"WTqjgwHlXbSFevF32_DJVw","stars":1.0,"useful":3,"funny":0,"cool":0,"text":"I have to say that this office really has it together, they are so organized and friendly!  Dr. J. Phillipp is a great dentist, very friendly and professional.  The dental assistants that helped in my procedure were amazing, Jewel and Bailey helped me to feel comfortable!  I don't have dental insurance, but they have this insurance through their office you can purchase for $80 something a year and this gave me 25% off all of my dental work, plus they helped me get signed up for care credit which I knew nothing about before this visit!  I highly recommend this office for the nice synergy the whole office has!","date":"2016-11-09 20:09:03"}
{"review_id":"yi0R0Ugj_xUx_Nek0-_Qig","user_id":"dacAIZ6fTM6mqwW5uxkskg","business_id":"ikCg8xy5JIg_NGPx-MSIDA","stars":1.0,"useful":0,"funny":0,"cool":0,"text":"Went in for a lunch. Steak sandwich was delicious, and the Caesar salad had an absolutely delicious dressing, with a perfect amount of dressing, and distributed perfectly across each leaf. I know I'm going on about the salad ... But it was perfect.\n\nDrink prices were pretty good.\n\nThe Server, Dawn, was friendly and accommodating. Very happy with her.\n\nIn summation, a great pub experience. Would go again!","date":"2018-01-09 20:56:38"}
{"review_id":"yi0R0Ugj_xUx_Nek0-_Qig","user_id":"dacAIZ6fTM6mqwW5uxkskg","business_id":"ikCg8xy5JIg_NGPx-MSIDA","stars":5.0,"useful":0,"funny":0,"cool":0,"text":"a b aa bb a b","date":"2018-01-09 20:56:38"}

but it is using all the memory on my computer before it can output into a text file. How can I use a less memory intensive way?

Only get text where stars == 5 :

Data:

  • Based on the question, the data is a file containing rows of dicts.

Get the text into a list:

  • Given the data from Yelp Challenge , getting the 5 stars text into a list, doesn't take that much memory.
    • The Windows resource manager showed an increase of about 1.3GB, but the object size of text_list was about 25MB.
import json

text_list = list()
with open("review.json", encoding="utf8") as f:
    for line in f:
        line = json.loads(line)
        if line['stars'] == 5:
            text_list.append(line['text'])

print(text_list)

>>> ['Test text, example 1!', 'Test text, example 2!']

Extra:

  • Everything after loading the data, seems to require a lot of memory that isn't being released.
  • When cleaning the text, Windows resource manager went up by 16GB, though the final size of clean_text was also only about 25MB.
    • Interestingly, deleting clean_text does not release the 16GB of memory.
    • In Jupyter Lab, restarting the Kernel will release the memory
    • In PyCharm, stopping the process also releases the memory
    • I tried manually running the garbage collector, but that didn't release the memory

Clean text_list :

import string

def clean_string(value: str) -> list:
    value = value.lower()
    value = value.translate(str.maketrans('', '', string.punctuation))
    value = value.split()
    return value

clean_text = [clean_string(item) for item in text_list]
print(clean_text)

>>> [['test', 'text', 'example', '1'], ['test', 'text', 'example', '2']]

Count words in clean_text :

from collection import Counter

words = Counter()

for item in clean_text:
    words.update(item)

print(words)

>>> Counter({'test': 2, 'text': 2, 'example': 2, '1': 1, '2': 1})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM