[英]pandas 0.18: out of memory error when reading CSV file with categoricals
我正在尝试使用read_csv
函数将 3GB 文件(250 万行,主要是分类(字符串)数据)读入 Pandas 数据帧并得到错误:内存不足
dtype
来改进对 CSV 的阅读,但是我的数据集中的列太多,我想先加载它,然后再决定数据类型。回溯是:
Traceback (most recent call last):
File "/home/a/Dropbox/Programming/Python/C and d/main.com.py", line 9, in <module>
preprocessing()
File "/home/a/Dropbox/Programming/Python/C and d/main.com.py", line 5, in preprocessing
df = pd.read_csv(filepath_or_buffer = file_path, sep ='\t', low_memory = False)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 498, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 285, in _read
return parser.read()
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 747, in read
ret = self._engine.read(nrows)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1197, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 769, in pandas.parser.TextReader.read (pandas/parser.c:8011)
File "pandas/parser.pyx", line 857, in pandas.parser.TextReader._read_rows (pandas/parser.c:9140)
File "pandas/parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas/parser.c:22649)
pandas.parser.CParserError: Error tokenizing data. C error: out of memory
我的代码:
import pandas as pd
def preprocessing():
file_path = r'/home/a/Downloads/main_query.txt'
df = pd.read_csv(filepath_or_buffer = file_path, sep ='\t', low_memory = False)
上面的代码产生了我在上面发布的错误消息。
然后我尝试删除low_memory = False
,一切正常,它只发出警告:
sys:1: DtypeWarning: Columns (17,20,23,24,33,44,58,118,134,
135,137,142,145,146,147) have mixed types.
Specify dtype option on import or set low_memory=False.
更新:在Pandas 0.19.0 中,应该可以在使用read_csv()
方法时指定categorical
:
pd.read_csv(filename, dtype={'col1': 'category'})
所以你可以尝试使用 pandas 0.19.0 RC1
旧答案:
您可以分块读取 CSV 并将其连接到每个步骤的结果 DF:
chunksize = 10**5
df = pd.DataFrame()
for chunk in (pd.read_csv(filename,
dtype={'col1':np.int8, 'col2':np.int32, ...}
chunksize=chunksize)
):
df = pd.concat([df, chunk], ignore_index=True)
注意:engine='python' 不支持参数dtype
问题是重复的:
df.info(memory_usage='deep')
或df.memory_usage(deep=True)
)pd.read_csv(..., dtype={'foo': 'category', 'bar': 'category', ...})
usecols = ['foo', 'bar', 'baz']
nrows=1e5
或另见skiprows=...
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.