[英]why int conversion is so much slower than float in pandas?
I have a 4Gb CSV file with strictly integer data I want to read into pandas DataFrame. 我有一个4Gb CSV文件,其中包含严格的整数数据,我想读入pandas DataFrame。 Native read_csv consumes all RAM (64Gb) and fails with MemoryError.
本机read_csv消耗了所有RAM(64Gb),并且由于MemoryError失败。 With explicit dtype, it just takes forever (tried both int and float types).
使用显式dtype,它将永久使用(尝试了int和float类型)。
So, I wrote my own reader: 所以,我写了我自己的读者:
def read_csv(fname):
import csv
reader = csv.reader(open(fname))
names = reader.next()[1:] # first row
dftype = np.float32
df = pd.DataFrame(0, dtype=dftype, columns=names, index=names)
for row in reader:
tag = row[0]
df.loc[tag] = np.array(row[1:], dtype=dftype)
return df
Problem: line df.loc[tag] = np.array(row[1:], dtype=dftype)
is ~1000 times slower if dftype is np.int32 (~20sec per line), so I ended up using np.float64 and return df.astype(np.int32)
(~4 minutes). 问题:如果dftype为np.int32(每行约20秒),则
df.loc[tag] = np.array(row[1:], dtype=dftype)
慢约1000倍,所以我最终使用了np.float64并return df.astype(np.int32)
(约4分钟)。 I also tried Python conversion ([int/float(v) for v in row[1:]]) with the same result. 我也尝试了Python转换([row [1:]中v的[int / float(v)]),结果相同。
Why could it be so? 为什么会这样呢?
UPD: I have the same behavior on Python 2.7 and 3.5 UPD:我在Python 2.7和3.5上具有相同的行为
UPDATE: my notebook has 16GB of RAM, so i'll test it with 4 times (64GB / 16Gb = 4) smaller DF: 更新:我的笔记本电脑具有16GB的RAM,因此我将使用4倍(64GB / 16Gb = 4)较小的DF对其进行测试:
Setup: 设定:
In [1]: df = pd.DataFrame(np.random.randint(0, 10*6, (12000, 47395)), dtype=np.int32)
In [2]: df.shape
Out[2]: (12000, 47395)
In [3]: %timeit -n 1 -r 1 df.to_csv('c:/tmp/big.csv', chunksize=1000)
1 loop, best of 1: 5min 34s per loop
Let's also save this DF in Feather format: 我们还以Feather格式保存此DF:
In [4]: import feather
In [6]: df = df.copy()
In [7]: %timeit -n 1 -r 1 feather.write_dataframe(df, 'c:/tmp/big.feather')
1 loop, best of 1: 8.41 s per loop # yay, it's bit faster...
In [8]: df.shape
Out[8]: (12000, 47395)
In [9]: del df
and read it back: 并读回:
In [10]: %timeit -n 1 -r 1 df = feather.read_dataframe('c:/tmp/big.feather')
1 loop, best of 1: 17.4 s per loop # reading is reasonably fast as well
reading from CSV file in chunks is much slower, but it is still not giving me MemoryError
: 从CSV文件中分块读取速度要慢得多,但是仍然不能给我
MemoryError
:
In [2]: %%timeit -n 1 -r 1
...: df = pd.DataFrame()
...: for chunk in pd.read_csv('c:/tmp/big.csv', index_col=0, chunksize=1000):
...: df = pd.concat([df, chunk])
...: print(df.shape)
...: print(df.dtypes.unique())
...:
(1000, 47395)
(2000, 47395)
(3000, 47395)
(4000, 47395)
(5000, 47395)
(6000, 47395)
(7000, 47395)
(8000, 47395)
(9000, 47395)
(10000, 47395)
(11000, 47395)
(12000, 47395)
[dtype('int64')]
1 loop, best of 1: 9min 25s per loop
now let's specify dtype=np.int32
explicitly: 现在让我们明确指定
dtype=np.int32
:
In [1]: %%timeit -n 1 -r 1
...: df = pd.DataFrame()
...: for chunk in pd.read_csv('c:/tmp/big.csv', index_col=0, chunksize=1000, dtype=np.int32):
...: df = pd.concat([df, chunk])
...: print(df.shape)
...: print(df.dtypes.unique())
...:
(1000, 47395)
(2000, 47395)
(3000, 47395)
(4000, 47395)
(5000, 47395)
(6000, 47395)
(7000, 47395)
(8000, 47395)
(9000, 47395)
(10000, 47395)
(11000, 47395)
(12000, 47395)
[dtype('int32')]
1 loop, best of 1: 10min 38s per loop
Testing HDF Storage: 测试HDF存储:
In [10]: %timeit -n 1 -r 1 df.to_hdf('c:/tmp/big.h5', 'test')
1 loop, best of 1: 22.5 s per loop
In [11]: del df
In [12]: %timeit -n 1 -r 1 df = pd.read_hdf('c:/tmp/big.h5', 'test')
1 loop, best of 1: 1.04 s per loop
if you have a chance to change your storage file format - by all means don't use CSV files - use HDF5 (.h5) or Feather format... 如果您有机会更改存储文件格式-绝对不要使用CSV文件-请使用HDF5(.h5)或Feather格式...
OLD answer: 旧答案:
I would simply use the native Pandas read_csv() method: 我将只使用本机Pandas的read_csv()方法:
chunksize = 10**6
reader = pd.read_csv(filename, index_col=0, chunksize=chunksize)
df = pd.concat([chunk for chunk in reader], ignore_indexes=True)
From your code: 从您的代码:
tag = row[0]
标签=行[0]
df.loc[tag] = np.array(row[1:], dtype=dftype)
df.loc [tag] = np.array(row [1:],dtype = dftype)
It looks like you want to use the first column in your CSV file as an index, hence: index_col=0
您似乎想将CSV文件中的第一列用作索引,因此:
index_col=0
I suggest you use numpy array for this, for example: 我建议您为此使用numpy数组,例如:
def read_csv(fname):
import csv
reader = csv.reader(open(fname))
names = reader.next()[1:] # first row
n = len(names)
data = np.empty((n, n), np.int32)
tag_map = {name:i for i, name in enumerate(names)}
for row in reader:
tag = row[0]
data[tag_map[tag], :] = row[1:]
return names, data
I don't know why int32
is slower than float32
, but DataFrame
stores data column wise, set elements of every column by df.loc[tag] = ...
is slow. 我不知道为什么
int32
比float32
慢,但是DataFrame
列存储数据,通过df.loc[tag] = ...
设置每列的元素很慢。
If you want labels for access, can you use xarray : 如果要访问标签,可以使用xarray :
import xarray
d = xarray.DataArray(data, [("r", names), ("c", names)])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.