![](/img/trans.png)
[英]Error tokenizing data. C error: out of memory pandas python, large file csv
[英]Error tokenizing data. C error: out of memory - python
我正在尝试读取以 | 分隔的 4 个 .txt 文件。
因为其中之一超过 1Gb df_tradeCash_mhi = pd.concat(chunk_read(mhi_tradeCashFiles, "MHI"))
我找到了读取它们的“块”方法,但我收到错误标记数据。 内存不足。
有谁知道我该如何解决这个问题?
下面是我的代码
def findmefile(directory, containsInFilename):
entity_filenames = {}
for file in os.listdir(directory):
if containsInFilename in file:
if file[:5] == "Trade":
entity_filenames["MHI"] = file
else:
entity_filenames[re.findall("(.*?)_", file)[0]] = file
return entity_filenames
# Get the core Murex file names
mhi_tradeFiles = findmefile(CoreMurexFilesLoc, "Trade")
mhi_tradeCashFiles = findmefile(CoreMurexFilesLoc, "TradeCash_")
mheu_tradeFiles = findmefile(CoreMurexFilesLoc, "MHEU")
mheu_tradeCashFiles = findmefile(CoreMurexFilesLoc, "MHEU_TradeCash")
# Read the csv using chunck
mylist = []
size = 10**2
def chunk_read(fileName, entity):
for chunk in pd.read_csv(
CoreMurexFilesLoc + "\\" + fileName[entity],
delimiter="|",
low_memory=False,
chunksize=size,
):
mylist.append(chunk)
return mylist
df_trade_mhi = pd.concat(chunk_read(mhi_tradeFiles, "MHI"))
df_trade_mheu = pd.concat(chunk_read(mheu_tradeFiles, "MHEU"))
df_tradeCash_mheu = pd.concat(chunk_read(mheu_tradeCashFiles, "MHEU"))
df_tradeCash_mhi = pd.concat(chunk_read(mhi_tradeCashFiles, "MHI"))
df_trades = pd.concat(
[df_trade_mheu, df_trade_mhi, df_tradeCash_mheu, df_tradeCash_mhi]
)
del df_trade_mhi
del df_tradeCash_mhi
del df_trade_mheu
del df_tradeCash_mheu
# Drop any blank fields and duplicates
nan_value = float("NaN")
df_trades.replace("", nan_value, inplace=True)
df_trades.dropna(subset=["MurexCounterpartyRef"], inplace=True)
df_trades.drop_duplicates(subset=["MurexCounterpartyRef"], inplace=True)
counterpartiesList = df_trades["MurexCounterpartyRef"].tolist()
print(colored('All Core Murex trade and tradeCash data loaded.', "green"))
错误:
Traceback (most recent call last):
File "h:\DESKTOP\test_check\check_securityPrices.py", line 52, in <module>
df_tradeCash_mhi = pd.concat(chunk_read(mhi_tradeCashFiles, "MHI"))
File "h:\DESKTOP\test_check\check_securityPrices.py", line 39, in chunk_read
for chunk in pd.read_csv(
File "C:\Users\MIRABR\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\readers.py", line 1024, in __next__
return self.get_chunk()
File "C:\Users\MIRABR\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\readers.py", line 1074, in get_chunk
return self.read(nrows=size)
File "C:\Users\MIRABR\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\readers.py", line 1047, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:\Users\MIRABR\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 228, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 783, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 857, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 843, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 1925, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: out of memory
我认为问题很明显 - 您的内存不足,因为您试图一次将如此多的数据加载到内存中,然后对其进行处理。
您需要:
第一种方法的问题是它不会无限扩展并且很昂贵。 第二种方法是正确的方法,但需要更多的编码。
作为生成器/协程类型管道方法的一个很好的参考,请查看 David Beazley 的任何 pycon 演讲。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.