Python中的MemoryError：如何优化代码？

Question

I have a number of json files to combine and output as a single csv (to load into R), with each json file at about 1.5gb. 我有许多json文件可以合并并作为单个csv输出（以加载到R中），每个json文件的大小约为1.5gb。 While doing a trial on 4-5 json files at 250mb each, I get the following error below. 当对每个250mb的4-5个json文件进行试用时，出现以下错误。 I'm running Python version '2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]' on 8gb ram and Windows 7 professional 64 bit. 我正在8gb ram和Windows 7 Professional 64位上运行Python版本'2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]' 。

I'm a Python novice and have little experience with writing optimized code and would appreciate guidance on how I can optimize my script below. 我是一名Python新手，几乎没有编写优化代码的经验，并且希望在下面提供有关如何优化脚本的指导。 Thank you! 谢谢！

======= Python MemoryError ======= ======= Python MemoryError =======

Traceback (most recent call last):
  File "C:\Users\...\tweetjson_to_csv.py", line 52, in <module>
    for line in file:
MemoryError
[Finished in 29.5s]

======= json to csv conversion script ======= ======= json到csv转换脚本=======

# csv file that you want to save to
out = open("output.csv", "ab")

filenames = ["8may.json", "9may.json", "10may.json", "11may.json", "12may.json"]
open_files = map(open, filenames)

# change argument to the file you want to open
for file in open_files:
    for line in file:
        # only keep tweets and not the empty lines
        if line.rstrip():
            try:
                tweets.append(json.loads(line))
            except:
                pass

for tweet in tweets:
    ids.append(tweet["id_str"])
    texts.append(tweet["text"])
    time_created.append(tweet["created_at"])
    retweet_counts.append(tweet["retweet_count"])
... ...

print >> out, "ids,text,time_created,retweet_counts,in_reply_to,geos,coordinates,places,country,language,screen_name,followers,friends,statuses,locations"
rows = zip(ids,texts,time_created,retweet_counts,in_reply_to_screen_name,geos,coordinates,places,places_country,lang,user_screen_names,user_followers_count,user_friends_count,user_statuses_count,user_locations)

csv = writer(out)

for row in rows:
    values = [(value.encode('utf8') if hasattr(value, 'encode') else value) for value in row]
    csv.writerow(values)

out.close()

Answer 1

This line right here: 这行就在这里：

open_files = map(open, filenames)

Opens every file at once concurrently. 同时打开每个文件。

Then you read everything and munge it into the same single array tweets . 然后，您阅读了所有内容并将其整理为相同的单个数组tweets 。

And you have two main for loops, so each tweet (of which there are several GBs worth) is iterated through ~~twice~~ a staggering 4 times! 而且您有两个主要的for循环，因此每个推文（其中有几个GB都需要经过两次迭代，如此反复才达到4倍！）。 Because you added in the zip function and then the iteration to write to the file. 因为您添加了zip函数，然后将迭代写入文件中。 Any one of those points could be the cause of the memory error. 这些问题中的任何一个都可能是导致内存错误的原因。

Unless absolutely necessary, try to only touch each piece of data once. 除非绝对必要，否则请尝试仅触摸每个数据一次。 As you iterate through a file, process the line and write it out immediately. 遍历文件时，请处理该行并立即将其写出。

Try something like this instead: 尝试这样的事情：

out = open("output.csv", "ab")

filenames = ["8may.json", "9may.json", "10may.json", "11may.json", "12may.json"]

def process_tweet_into_line(line):
    # load as json, process turn into a csv and return
    return line

# change argument to the file you want to open
for name in file_names:
    with open(name) as file:
        for line in file:
            # only keep tweets and not the empty lines
            if line.rstrip():
                try:
                    tweet = process_tweet_into_line(line)
                    out.write(line)
                except:
                    pass

Python中的MemoryError：如何优化代码？

问题描述

1 个解决方案

解决方案1
3

Python中的MemoryError：如何优化代码？

问题描述

1 个解决方案

解决方案1 3

解决方案1
3