简体   繁体   English

带有队列和线程的Python MemoryError

[英]Python MemoryError with Queue and threading

I'm currently writing a script that reads reddit comments from a large file (5 gigs compressed, ~30 gigs of data being read). 我当前正在编写一个脚本,该脚本从一个大文件中读取reddit注释(压缩了5 gig,正在读取约30 gig的数据)。 My script reads the comments, checks for some text, parses them, and sends them off to a Queue function (running in a seperate thread). 我的脚本读取注释,检查一些文本,对其进行解析,然后将其发送到Queue函数(在单独的线程中运行)。 No matter what I do, I always get a MemoryError on a specific iteration (number 8162735 if it matters in the slightest). 不管我做什么,我总是在特定的迭代中收到MemoryError(如果有丝毫影响,请返回8162735)。 And I can't seem to handle the error, Windows just keeps shutting down python when it hits. 而且我似乎无法处理该错误,Windows不断在其命中时不断关闭python。 Here's my script: 这是我的脚本:

import ujson
from tqdm import tqdm
import bz2
import json
import threading
import spacy
import Queue
import time
nlp = spacy.load('en')
def iter_comments(loc):
    with bz2.BZ2File(loc) as file_:
        for i, line in (enumerate(file_)):
            yield ujson.loads(line)['body']
objects = iter_comments('RC_2015-01.bz2')
q = Queue.Queue()
f = open("reddit_dump.bin", 'wb')
def worker():
    while True:
        item = q.get()
        f.write(item)
        q.task_done()
for i in range(0, 2):
    t = threading.Thread(target=worker)
    t.daemon = True
    t.start()
def finish_parse(comment):
    global q
    try:
        comment_parse = nlp(unicode(comment))
        comment_bytes = comment_parse.to_bytes()
        q.put(comment_bytes)
    except MemoryError:
        print "MemoryError with comment {0}, waiting for Queue to empty".format(comment)
        time.sleep(2)
    except AssertionError:
        print "AssertionError with comment {0}, skipping".format(comment)
for comment in tqdm(objects):
    comment = str(comment.encode('ascii', 'ignore'))
    if ">" in comment:
        c_parse_thread = threading.Thread(target=finish_parse, args=(comment,))
        c_parse_thread.start()          
q.join()
f.close()

Does anybody know what I'm doing wrong? 有人知道我在做什么错吗?

Looks like its not in your code but may be in the data. 看起来它不在您的代码中,但可能在数据中。 Have you tried to skip that iteration? 您是否尝试过跳过该迭代?

x = 0
for comment in tqdm(objects):
    x += 1
    if x != 8162735

        comment = str(comment.encode('ascii', 'ignore'))
        if ">" in comment:
            c_parse_thread = threading.Thread(target=finish_parse, args=(comment,))
            c_parse_thread.start()  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM