簡體   English   中英

更少的 memory 密集方式來解析 Python 中的大型 JSON 文件

[英]Less memory intensive way to parse large JSON file in Python

這是我的代碼

import json
data = []
with open("review.json") as f:
    for line in f:
        data.append(json.loads(line))

lst_string = []
lst_num = []
for i in range(len(data)):
    if (data[i]["stars"] == 5.0):
        x = data[i]["text"]
        for word in x.split():
            if word in lst_string:
                lst_num[lst_string.index(word)] += 1
            else:
                lst_string.append(word)
                lst_num.append(1)

result = set(zip(lst_string, lst_num))
print(result)
with open("set.txt", "w") as g:
    g.write(str(result))

我正在嘗試寫一組評論中的所有單詞,這些單詞從 json 文件中提取,獲得 5 星,格式如下

{"review_id":"Q1sbwvVQXV2734tPgoKj4Q","user_id":"hG7b0MtEbXx5QzbzE6C_VA","business_id":"ujmEBvifdJM6h6RLv4wQIg","stars":1.0,"useful":6,"funny":1,"cool":0,"text":"Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.","date":"2013-05-07 04:34:36"}
{"review_id":"GJXCdrto3ASJOqKeVWPi6Q","user_id":"yXQM5uF2jS6es16SJzNHfg","business_id":"NZnhc2sEQy3RmzKTZnqtwQ","stars":1.0,"useful":0,"funny":0,"cool":0,"text":"I *adore* Travis at the Hard Rock's new Kelly Cardenas Salon!  I'm always a fan of a great blowout and no stranger to the chains that offer this service; however, Travis has taken the flawless blowout to a whole new level!  \n\nTravis's greets you with his perfectly green swoosh in his otherwise perfectly styled black hair and a Vegas-worthy rockstar outfit.  Next comes the most relaxing and incredible shampoo -- where you get a full head message that could cure even the very worst migraine in minutes --- and the scented shampoo room.  Travis has freakishly strong fingers (in a good way) and use the perfect amount of pressure.  That was superb!  Then starts the glorious blowout... where not one, not two, but THREE people were involved in doing the best round-brush action my hair has ever seen.  The team of stylists clearly gets along extremely well, as it's evident from the way they talk to and help one another that it's really genuine and not some corporate requirement.  It was so much fun to be there! \n\nNext Travis started with the flat iron.  The way he flipped his wrist to get volume all around without over-doing it and making me look like a Texas pagent girl was admirable.  It's also worth noting that he didn't fry my hair -- something that I've had happen before with less skilled stylists.  At the end of the blowout & style my hair was perfectly bouncey and looked terrific.  The only thing better?  That this awesome blowout lasted for days! \n\nTravis, I will see you every single time I'm out in Vegas.  You make me feel beauuuutiful!","date":"2017-01-14 21:30:33"}
{"review_id":"2TzJjDVDEuAW6MR5Vuc1ug","user_id":"n6-Gk65cPZL6Uz8qRm3NYw","business_id":"WTqjgwHlXbSFevF32_DJVw","stars":1.0,"useful":3,"funny":0,"cool":0,"text":"I have to say that this office really has it together, they are so organized and friendly!  Dr. J. Phillipp is a great dentist, very friendly and professional.  The dental assistants that helped in my procedure were amazing, Jewel and Bailey helped me to feel comfortable!  I don't have dental insurance, but they have this insurance through their office you can purchase for $80 something a year and this gave me 25% off all of my dental work, plus they helped me get signed up for care credit which I knew nothing about before this visit!  I highly recommend this office for the nice synergy the whole office has!","date":"2016-11-09 20:09:03"}
{"review_id":"yi0R0Ugj_xUx_Nek0-_Qig","user_id":"dacAIZ6fTM6mqwW5uxkskg","business_id":"ikCg8xy5JIg_NGPx-MSIDA","stars":1.0,"useful":0,"funny":0,"cool":0,"text":"Went in for a lunch. Steak sandwich was delicious, and the Caesar salad had an absolutely delicious dressing, with a perfect amount of dressing, and distributed perfectly across each leaf. I know I'm going on about the salad ... But it was perfect.\n\nDrink prices were pretty good.\n\nThe Server, Dawn, was friendly and accommodating. Very happy with her.\n\nIn summation, a great pub experience. Would go again!","date":"2018-01-09 20:56:38"}
{"review_id":"yi0R0Ugj_xUx_Nek0-_Qig","user_id":"dacAIZ6fTM6mqwW5uxkskg","business_id":"ikCg8xy5JIg_NGPx-MSIDA","stars":5.0,"useful":0,"funny":0,"cool":0,"text":"a b aa bb a b","date":"2018-01-09 20:56:38"}

但它在將 output 轉換為文本文件之前,正在使用我計算機上的所有 memory。 如何使用較少的 memory 密集方式?

僅獲取stars == 5的文本:

數據:

  • 基於這個問題,數據是一個包含字典行的文件。

將文本放入列表中:

  • 鑒於Yelp Challenge的數據,將5 stars文本放入列表中並不會占用太多 memory。
    • Windows 資源管理器顯示增加了約 1.3GB,但 text_list 的text_list大小約為 25MB。
import json

text_list = list()
with open("review.json", encoding="utf8") as f:
    for line in f:
        line = json.loads(line)
        if line['stars'] == 5:
            text_list.append(line['text'])

print(text_list)

>>> ['Test text, example 1!', 'Test text, example 2!']

額外的:

  • 加載數據后的所有內容,似乎都需要大量尚未發布的 memory。
  • 清理文本時,Windows 資源管理器增加了 16GB,盡管clean_text的最終大小也只有 25MB 左右。
    • 有趣的是,刪除clean_text並不會釋放 16GB 的 memory。
    • 在 Jupyter Lab 中,重啟 Kernel 將釋放 memory
    • 在 PyCharm 中,停止進程也會釋放 memory
    • 我嘗試手動運行垃圾收集器,但沒有釋放 memory

清理text_list

import string

def clean_string(value: str) -> list:
    value = value.lower()
    value = value.translate(str.maketrans('', '', string.punctuation))
    value = value.split()
    return value

clean_text = [clean_string(item) for item in text_list]
print(clean_text)

>>> [['test', 'text', 'example', '1'], ['test', 'text', 'example', '2']]

計算clean_text中的單詞:

from collection import Counter

words = Counter()

for item in clean_text:
    words.update(item)

print(words)

>>> Counter({'test': 2, 'text': 2, 'example': 2, '1': 1, '2': 1})

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM