更少的 memory 密集方式来解析 Python 中的大型 JSON 文件

Question

Here is my code这是我的代码

import json
data = []
with open("review.json") as f:
    for line in f:
        data.append(json.loads(line))

lst_string = []
lst_num = []
for i in range(len(data)):
    if (data[i]["stars"] == 5.0):
        x = data[i]["text"]
        for word in x.split():
            if word in lst_string:
                lst_num[lst_string.index(word)] += 1
            else:
                lst_string.append(word)
                lst_num.append(1)

result = set(zip(lst_string, lst_num))
print(result)
with open("set.txt", "w") as g:
    g.write(str(result))

I'm trying to write a set of all words in reviews that were given 5 stars from a pulled in json file formatted like我正在尝试写一组评论中的所有单词，这些单词从 json 文件中提取，获得 5 星，格式如下

{"review_id":"Q1sbwvVQXV2734tPgoKj4Q","user_id":"hG7b0MtEbXx5QzbzE6C_VA","business_id":"ujmEBvifdJM6h6RLv4wQIg","stars":1.0,"useful":6,"funny":1,"cool":0,"text":"Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.","date":"2013-05-07 04:34:36"}
{"review_id":"GJXCdrto3ASJOqKeVWPi6Q","user_id":"yXQM5uF2jS6es16SJzNHfg","business_id":"NZnhc2sEQy3RmzKTZnqtwQ","stars":1.0,"useful":0,"funny":0,"cool":0,"text":"I *adore* Travis at the Hard Rock's new Kelly Cardenas Salon!  I'm always a fan of a great blowout and no stranger to the chains that offer this service; however, Travis has taken the flawless blowout to a whole new level!  \n\nTravis's greets you with his perfectly green swoosh in his otherwise perfectly styled black hair and a Vegas-worthy rockstar outfit.  Next comes the most relaxing and incredible shampoo -- where you get a full head message that could cure even the very worst migraine in minutes --- and the scented shampoo room.  Travis has freakishly strong fingers (in a good way) and use the perfect amount of pressure.  That was superb!  Then starts the glorious blowout... where not one, not two, but THREE people were involved in doing the best round-brush action my hair has ever seen.  The team of stylists clearly gets along extremely well, as it's evident from the way they talk to and help one another that it's really genuine and not some corporate requirement.  It was so much fun to be there! \n\nNext Travis started with the flat iron.  The way he flipped his wrist to get volume all around without over-doing it and making me look like a Texas pagent girl was admirable.  It's also worth noting that he didn't fry my hair -- something that I've had happen before with less skilled stylists.  At the end of the blowout & style my hair was perfectly bouncey and looked terrific.  The only thing better?  That this awesome blowout lasted for days! \n\nTravis, I will see you every single time I'm out in Vegas.  You make me feel beauuuutiful!","date":"2017-01-14 21:30:33"}
{"review_id":"2TzJjDVDEuAW6MR5Vuc1ug","user_id":"n6-Gk65cPZL6Uz8qRm3NYw","business_id":"WTqjgwHlXbSFevF32_DJVw","stars":1.0,"useful":3,"funny":0,"cool":0,"text":"I have to say that this office really has it together, they are so organized and friendly!  Dr. J. Phillipp is a great dentist, very friendly and professional.  The dental assistants that helped in my procedure were amazing, Jewel and Bailey helped me to feel comfortable!  I don't have dental insurance, but they have this insurance through their office you can purchase for $80 something a year and this gave me 25% off all of my dental work, plus they helped me get signed up for care credit which I knew nothing about before this visit!  I highly recommend this office for the nice synergy the whole office has!","date":"2016-11-09 20:09:03"}
{"review_id":"yi0R0Ugj_xUx_Nek0-_Qig","user_id":"dacAIZ6fTM6mqwW5uxkskg","business_id":"ikCg8xy5JIg_NGPx-MSIDA","stars":1.0,"useful":0,"funny":0,"cool":0,"text":"Went in for a lunch. Steak sandwich was delicious, and the Caesar salad had an absolutely delicious dressing, with a perfect amount of dressing, and distributed perfectly across each leaf. I know I'm going on about the salad ... But it was perfect.\n\nDrink prices were pretty good.\n\nThe Server, Dawn, was friendly and accommodating. Very happy with her.\n\nIn summation, a great pub experience. Would go again!","date":"2018-01-09 20:56:38"}
{"review_id":"yi0R0Ugj_xUx_Nek0-_Qig","user_id":"dacAIZ6fTM6mqwW5uxkskg","business_id":"ikCg8xy5JIg_NGPx-MSIDA","stars":5.0,"useful":0,"funny":0,"cool":0,"text":"a b aa bb a b","date":"2018-01-09 20:56:38"}

but it is using all the memory on my computer before it can output into a text file.但它在将 output 转换为文本文件之前，正在使用我计算机上的所有 memory。 How can I use a less memory intensive way?如何使用较少的 memory 密集方式？

Answer 1

Only get text where `stars == 5` :仅获取`stars == 5`的文本：

Data:数据：

Based on the question, the data is a file containing rows of dicts.基于这个问题，数据是一个包含字典行的文件。

Get the text into a list:将文本放入列表中：

Given the data from Yelp Challenge , getting the 5 stars text into a list, doesn't take that much memory.鉴于Yelp Challenge的数据，将5 stars文本放入列表中并不会占用太多 memory。
- The Windows resource manager showed an increase of about 1.3GB, but the object size of text_list was about 25MB. Windows 资源管理器显示增加了约 1.3GB，但 text_list 的text_list大小约为 25MB。

import json

text_list = list()
with open("review.json", encoding="utf8") as f:
    for line in f:
        line = json.loads(line)
        if line['stars'] == 5:
            text_list.append(line['text'])

print(text_list)

>>> ['Test text, example 1!', 'Test text, example 2!']

Extra:额外的：

Everything after loading the data, seems to require a lot of memory that isn't being released.加载数据后的所有内容，似乎都需要大量尚未发布的 memory。
When cleaning the text, Windows resource manager went up by 16GB, though the final size of clean_text was also only about 25MB.清理文本时，Windows 资源管理器增加了 16GB，尽管clean_text的最终大小也只有 25MB 左右。
- Interestingly, deleting clean_text does not release the 16GB of memory.有趣的是，删除clean_text并不会释放 16GB 的 memory。
- In Jupyter Lab, restarting the Kernel will release the memory在 Jupyter Lab 中，重启 Kernel 将释放 memory
- In PyCharm, stopping the process also releases the memory在 PyCharm 中，停止进程也会释放 memory
- I tried manually running the garbage collector, but that didn't release the memory我尝试手动运行垃圾收集器，但没有释放 memory

Clean `text_list` :清理`text_list` ：

import string

def clean_string(value: str) -> list:
    value = value.lower()
    value = value.translate(str.maketrans('', '', string.punctuation))
    value = value.split()
    return value

clean_text = [clean_string(item) for item in text_list]
print(clean_text)

>>> [['test', 'text', 'example', '1'], ['test', 'text', 'example', '2']]

Count words in `clean_text` :计算`clean_text`中的单词：

from collection import Counter

words = Counter()

for item in clean_text:
    words.update(item)

print(words)

>>> Counter({'test': 2, 'text': 2, 'example': 2, '1': 1, '2': 1})

更少的 memory 密集方式来解析 Python 中的大型 JSON 文件

问题描述

1 个解决方案

解决方案1
0 2019-10-04 00:37:47

Only get text where `stars == 5` :仅获取`stars == 5`的文本：

Data:数据：

Get the text into a list:将文本放入列表中：

Extra:额外的：

Clean `text_list` :清理`text_list` ：

Count words in `clean_text` :计算`clean_text`中的单词：

更少的 memory 密集方式来解析 Python 中的大型 JSON 文件

问题描述

1 个解决方案

解决方案1 0 2019-10-04 00:37:47

Only get text where stars == 5 :仅获取stars == 5的文本：

Data:数据：

Get the text into a list:将文本放入列表中：

Extra:额外的：

Clean text_list :清理text_list ：

Count words in clean_text :计算clean_text中的单词：

解决方案1
0 2019-10-04 00:37:47

Only get text where `stars == 5` :仅获取`stars == 5`的文本：

Clean `text_list` :清理`text_list` ：

Count words in `clean_text` :计算`clean_text`中的单词：