繁体   English   中英

标记数据时出错。 C 错误:内存不足 - python

[英]Error tokenizing data. C error: out of memory - python

我正在尝试读取以 | 分隔的 4 个 .txt 文件。

因为其中之一超过 1Gb df_tradeCash_mhi = pd.concat(chunk_read(mhi_tradeCashFiles, "MHI"))

我找到了读取它们的“块”方法,但我收到错误标记数据。 内存不足。

有谁知道我该如何解决这个问题?

下面是我的代码


def findmefile(directory, containsInFilename):
    entity_filenames = {}
    for file in os.listdir(directory):
        if containsInFilename in file:
            if file[:5] == "Trade":
                entity_filenames["MHI"] = file
            else:
                entity_filenames[re.findall("(.*?)_", file)[0]] = file
    return entity_filenames

# Get the core Murex file names
mhi_tradeFiles = findmefile(CoreMurexFilesLoc, "Trade")
mhi_tradeCashFiles = findmefile(CoreMurexFilesLoc, "TradeCash_")
mheu_tradeFiles = findmefile(CoreMurexFilesLoc, "MHEU")
mheu_tradeCashFiles = findmefile(CoreMurexFilesLoc, "MHEU_TradeCash")

# Read the csv using chunck
mylist = []
size = 10**2
def chunk_read(fileName, entity):
    for chunk in pd.read_csv(
        CoreMurexFilesLoc + "\\" + fileName[entity],
        delimiter="|",
        low_memory=False,
        chunksize=size,
    ):
        mylist.append(chunk)
    return mylist


df_trade_mhi = pd.concat(chunk_read(mhi_tradeFiles, "MHI"))
df_trade_mheu = pd.concat(chunk_read(mheu_tradeFiles, "MHEU"))
df_tradeCash_mheu = pd.concat(chunk_read(mheu_tradeCashFiles, "MHEU"))
df_tradeCash_mhi = pd.concat(chunk_read(mhi_tradeCashFiles, "MHI"))

df_trades = pd.concat(
    [df_trade_mheu, df_trade_mhi, df_tradeCash_mheu, df_tradeCash_mhi]
)

del df_trade_mhi
del df_tradeCash_mhi
del df_trade_mheu
del df_tradeCash_mheu

# Drop any blank fields and duplicates
nan_value = float("NaN")
df_trades.replace("", nan_value, inplace=True)
df_trades.dropna(subset=["MurexCounterpartyRef"], inplace=True)
df_trades.drop_duplicates(subset=["MurexCounterpartyRef"], inplace=True)

counterpartiesList = df_trades["MurexCounterpartyRef"].tolist()

print(colored('All Core Murex trade and tradeCash data loaded.', "green"))

错误:

Traceback (most recent call last):
  File "h:\DESKTOP\test_check\check_securityPrices.py", line 52, in <module>
    df_tradeCash_mhi = pd.concat(chunk_read(mhi_tradeCashFiles, "MHI"))
  File "h:\DESKTOP\test_check\check_securityPrices.py", line 39, in chunk_read
    for chunk in pd.read_csv(
  File "C:\Users\MIRABR\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\readers.py", line 1024, in __next__
    return self.get_chunk()
  File "C:\Users\MIRABR\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\readers.py", line 1074, in get_chunk
    return self.read(nrows=size)
  File "C:\Users\MIRABR\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\readers.py", line 1047, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "C:\Users\MIRABR\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 228, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 783, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 857, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 843, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 1925, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: out of memory

我认为问题很明显 - 您的内存不足,因为您试图一次将如此多的数据加载到内存中,然后对其进行处理。

您需要:

  • 买一台内存更大的机器。
  • 使用生成器或协程管道重新构建解决方案以使用流水线方法对数据进行逐步处理。

第一种方法的问题是它不会无限扩展并且很昂贵。 第二种方法是正确的方法,但需要更多的编码。

作为生成器/协程类型管道方法的一个很好的参考,请查看 David Beazley 的任何 pycon 演讲。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM