MemoryError：無法為形狀為 (5004, 96) 且數據類型為 int32 的數組分配 1.83 MiB

Question

當我想處理一個巨大的 csv 文件時，我收到 MemoryError MemoryError: Unable to allocate 1.83 MiB for an array with shape (5004, 96) and data type int32 。 錯誤發生在：

processed_texts = [text for text in nlp.pipe(str(tokenized_texts),
                                             disable=["ner",
                                                      "parser"])]

當我使用多個線程時會解決這個問題嗎？ 如果是這樣，有沒有人在 Python 中有一些例子，因為我來自 Java..

整個腳本：

df = pd.read_csv('posts_result.csv')
df_sample = df.sample(frac=0.1, replace=False, random_state=1)

""" DATA EXPLORATION """

text_test = df_sample.post.tolist()


# Start the tokenization
def tokenize_hashtag(text):
    punctuations = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~'
    for punctuation in punctuations:
        text = str(text).replace(punctuation, '')
    text = text.lower()
    text = text.split()
    return text


tokenized_texts = [tokenize_hashtag(text) for text in text_test]

nlp = spacy.load("en_core_web_sm")

processed_texts = [text for text in nlp.pipe(str(tokenized_texts),
                                             disable=["ner",
                                                      "parser"])]

df_sample['processed'] = tokenized_texts

tokenized_texts = [[word.text for word in text if (word.pos_ == 'NOUN' or word.pos_ == 'VERB' or word.pos_ == 'PROPN') and not len(word.text) >12 and not word.is_punct and not word.is_stop and not word.text=='X'
                   and not word.text == '@Name']
                    for text in processed_texts]

Answer 1

您在這里沒有真正提供足夠的信息，但看起來您無法保存 memory 中的所有 spaCy 文檔。

一個非常簡單的解決方法是將您的 CSV 文件拆分並一次處理一個塊。

您可以做的另一件事是，因為看起來您只是在保存一些單詞，所以您可以通過稍微更改 for 循環來避免保存文檔。

nlp = spacy.load("en_core_web_sm")

def keep_word(word):
    if word.pos_ not in ("NOUN", "VERB", "PROPN"):
        return False
    if word.text == "@Name":
        return False
    return True

out = []
for doc in nlp.pipe(str(tokenized_texts),disable=["ner", "parser"]):
    out.append([ww.text for ww in doc if keep_word(ww)])

這樣你就可以只保留你想要的字符串而不是文檔，所以它應該減少 memory 的使用。

關於您的代碼的其他一些評論...

無論您嘗試使用主題標簽 function 做什么，它都不起作用。 如果你調用str(text.split()) output 真的很奇怪 - 它會將I like cheese變成['I', 'like', 'cheese'] - 它會導致 spaCy 給你廢話 output。 我建議不要使用 function，spaCy 希望處理標點符號。

您似乎正在使用 spaCy 來刪除基於詞性的單詞（大部分情況下），但這通常不是一個好主意 - 現代文本處理不需要那種預過濾。 就像 15 年前一樣，這仍然是常見的做法，但你應該能夠對任何合理的 model 給出完整的句子，它們會比過度過濾的文本更好。

MemoryError：無法為形狀為 (5004, 96) 且數據類型為 int32 的數組分配 1.83 MiB

問題描述

1 個解決方案

解決方案1
1 已采納 2021-06-02 04:59:49

MemoryError：無法為形狀為 (5004, 96) 且數據類型為 int32 的數組分配 1.83 MiB

問題描述

1 個解決方案

解決方案1 1 已采納 2021-06-02 04:59:49

解決方案1
1 已采納 2021-06-02 04:59:49