简体   繁体   English

使用 Python 在 json 文件中读取很长的行时出现内存错误

[英]Memory error while reading a very long line in a json file with Python

I've a 1GB json file with very long lines, when I try to load a line from the file I get this error from PyCharm console:我有一个带有很长行的 1GB json 文件,当我尝试从文件中加载一行时,我从 PyCharm 控制台收到此错误:

Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm 2017.3.3\helpers\pydev\pydev_run_in_console.py", line 53, in run_file
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "......... .py", line 26, in <module>
    for line in f:
MemoryError
PyDev console: starting.
Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:19:30) [MSC v.1500 32 bit (Intel)] on win32

I'm on a Windows Server machine with 64GB of RAM.我在一台具有 64GB RAM 的 Windows Server 机器上。

My code is:我的代码是:

import numpy as np
import json
import sys
import re

idRegEx = re.compile(r".*ID=")
endElRegEx = re.compile(r"'.*")

ratingsFile = sys.argv[1]
tweetsFile = sys.argv[2]
outputFile = sys.argv[3]

tweetsMap = {}
with open(tweetsFile, "r") as f:

    for line in f:
        tweetData = json.loads(line)
        tweetsMap[tweetData["key"]] = tweetData

output = open(outputFile, "w")

with open(ratingsFile, "r") as f:
    header = f.next()

    for line in f:
        topicData = line.split("\t")

        topicKey = topicData[0]
        topicTerms = topicData[1]
        ratings = topicData[2]
        reasons = topicData[3]

        ratings = map(lambda x: int(x.strip().replace("'", "")), ratings.replace("[", "").replace("]", "").split(","))
        ratings = np.array(ratings)

        tweetsMap[topicKey]["ratings"] = ratings.tolist()
        tweetsMap[topicKey]["mean"] = ratings.mean()

        topicMap = tweetsMap[topicKey]

        print topicMap["key"], topicMap["mean"]

        json.dump(topicMap, output, sort_keys=True)
        output.write("\n")

output.close()

Line 26 in the error message refers to错误消息中的第 26 行是指

tweetData = json.loads(line)

while line 53 refers to而第 53 行是指

json.dump(topicMap, output, sort_keys=True)

The strange thing is that I forked this code from GitHub and so I think it should work.奇怪的是,我从 GitHub 分叉了这段代码,所以我认为它应该可以工作。

It looks like you're using a 32-bit version of Python:看起来您使用的是 32 位版本的 Python:

Python 2.7.14 (...) [MSC v.1500 32 bit (Intel)] on win32

It has a memory limit of 2GB per process on Windows, so that's why you're getting the memory error even though you have plenty of RAM. Windows 上每个进程的内存限制为 2GB,这就是为什么即使您有足够的 RAM 也会收到内存错误的原因。 Switching to the 64-bit version of Python should fix your issue, in case you don't want to change your script.如果您不想更改脚本,切换到 64 位版本的 Python 应该可以解决您的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM