帶有隊列和線程的Python MemoryError

Question

我當前正在編寫一個腳本，該腳本從一個大文件中讀取reddit注釋（壓縮了5 gig，正在讀取約30 gig的數據）。 我的腳本讀取注釋，檢查一些文本，對其進行解析，然后將其發送到Queue函數（在單獨的線程中運行）。 不管我做什么，我總是在特定的迭代中收到MemoryError（如果有絲毫影響，請返回8162735）。 而且我似乎無法處理該錯誤，Windows不斷在其命中時不斷關閉python。 這是我的腳本：

import ujson
from tqdm import tqdm
import bz2
import json
import threading
import spacy
import Queue
import time
nlp = spacy.load('en')
def iter_comments(loc):
    with bz2.BZ2File(loc) as file_:
        for i, line in (enumerate(file_)):
            yield ujson.loads(line)['body']
objects = iter_comments('RC_2015-01.bz2')
q = Queue.Queue()
f = open("reddit_dump.bin", 'wb')
def worker():
    while True:
        item = q.get()
        f.write(item)
        q.task_done()
for i in range(0, 2):
    t = threading.Thread(target=worker)
    t.daemon = True
    t.start()
def finish_parse(comment):
    global q
    try:
        comment_parse = nlp(unicode(comment))
        comment_bytes = comment_parse.to_bytes()
        q.put(comment_bytes)
    except MemoryError:
        print "MemoryError with comment {0}, waiting for Queue to empty".format(comment)
        time.sleep(2)
    except AssertionError:
        print "AssertionError with comment {0}, skipping".format(comment)
for comment in tqdm(objects):
    comment = str(comment.encode('ascii', 'ignore'))
    if "&gt;" in comment:
        c_parse_thread = threading.Thread(target=finish_parse, args=(comment,))
        c_parse_thread.start()          
q.join()
f.close()

有人知道我在做什么錯嗎？

Answer 1

看起來它不在您的代碼中，但可能在數據中。 您是否嘗試過跳過該迭代？

x = 0
for comment in tqdm(objects):
    x += 1
    if x != 8162735

        comment = str(comment.encode('ascii', 'ignore'))
        if "&gt;" in comment:
            c_parse_thread = threading.Thread(target=finish_parse, args=(comment,))
            c_parse_thread.start()

帶有隊列和線程的Python MemoryError

問題描述

1 個解決方案

解決方案1
0 2016-07-16 07:27:54

帶有隊列和線程的Python MemoryError

問題描述

1 個解決方案

解決方案1 0 2016-07-16 07:27:54

解決方案1
0 2016-07-16 07:27:54