简体   繁体   English

为什么int转换比panda中的float慢得多?

[英]why int conversion is so much slower than float in pandas?

I have a 4Gb CSV file with strictly integer data I want to read into pandas DataFrame. 我有一个4Gb CSV文件,其中包含严格的整数数据,我想读入pandas DataFrame。 Native read_csv consumes all RAM (64Gb) and fails with MemoryError. 本机read_csv消耗了所有RAM(64Gb),并且由于MemoryError失败。 With explicit dtype, it just takes forever (tried both int and float types). 使用显式dtype,它将永久使用(尝试了int和float类型)。

So, I wrote my own reader: 所以,我写了我自己的读者:

def read_csv(fname):
    import csv
    reader = csv.reader(open(fname))
    names = reader.next()[1:]  # first row
    dftype = np.float32
    df = pd.DataFrame(0, dtype=dftype, columns=names, index=names)
    for row in reader:
        tag = row[0]
        df.loc[tag] = np.array(row[1:], dtype=dftype)
    return df

Problem: line df.loc[tag] = np.array(row[1:], dtype=dftype) is ~1000 times slower if dftype is np.int32 (~20sec per line), so I ended up using np.float64 and return df.astype(np.int32) (~4 minutes). 问题:如果dftype为np.int32(每行约20秒),则df.loc[tag] = np.array(row[1:], dtype=dftype)慢约1000倍,所以我最终使用了np.float64并return df.astype(np.int32) (约4分钟)。 I also tried Python conversion ([int/float(v) for v in row[1:]]) with the same result. 我也尝试了Python转换([row [1:]中v的[int / float(v)]),结果相同。

Why could it be so? 为什么会这样呢?

UPD: I have the same behavior on Python 2.7 and 3.5 UPD:我在Python 2.7和3.5上具有相同的行为

UPDATE: my notebook has 16GB of RAM, so i'll test it with 4 times (64GB / 16Gb = 4) smaller DF: 更新:我的笔记本电脑具有16GB的RAM,因此我将使用4倍(64GB / 16Gb = 4)较小的DF对其进行测试:

Setup: 设定:

In [1]: df = pd.DataFrame(np.random.randint(0, 10*6, (12000, 47395)), dtype=np.int32)

In [2]: df.shape
Out[2]: (12000, 47395)

In [3]: %timeit -n 1 -r 1 df.to_csv('c:/tmp/big.csv', chunksize=1000)
1 loop, best of 1: 5min 34s per loop

Let's also save this DF in Feather format: 我们还以Feather格式保存此DF:

In [4]: import feather

In [6]: df = df.copy()

In [7]: %timeit -n 1 -r 1 feather.write_dataframe(df, 'c:/tmp/big.feather')
1 loop, best of 1: 8.41 s per loop  # yay, it's bit faster...

In [8]: df.shape
Out[8]: (12000, 47395)

In [9]: del df

and read it back: 并读回:

In [10]: %timeit -n 1 -r 1 df = feather.read_dataframe('c:/tmp/big.feather')
1 loop, best of 1: 17.4 s per loop  # reading is reasonably fast as well

reading from CSV file in chunks is much slower, but it is still not giving me MemoryError : 从CSV文件中分块读取速度要慢得多,但是仍然不能给我MemoryError

In [2]: %%timeit -n 1 -r 1
   ...: df = pd.DataFrame()
   ...: for chunk in pd.read_csv('c:/tmp/big.csv', index_col=0, chunksize=1000):
   ...:     df = pd.concat([df, chunk])
   ...:     print(df.shape)
   ...: print(df.dtypes.unique())
   ...:
(1000, 47395)
(2000, 47395)
(3000, 47395)
(4000, 47395)
(5000, 47395)
(6000, 47395)
(7000, 47395)
(8000, 47395)
(9000, 47395)
(10000, 47395)
(11000, 47395)
(12000, 47395)
[dtype('int64')]
1 loop, best of 1: 9min 25s per loop

now let's specify dtype=np.int32 explicitly: 现在让我们明确指定dtype=np.int32

In [1]: %%timeit -n 1 -r 1
   ...: df = pd.DataFrame()
   ...: for chunk in pd.read_csv('c:/tmp/big.csv', index_col=0, chunksize=1000, dtype=np.int32):
   ...:     df = pd.concat([df, chunk])
   ...:     print(df.shape)
   ...: print(df.dtypes.unique())
   ...:
(1000, 47395)
(2000, 47395)
(3000, 47395)
(4000, 47395)
(5000, 47395)
(6000, 47395)
(7000, 47395)
(8000, 47395)
(9000, 47395)
(10000, 47395)
(11000, 47395)
(12000, 47395)
[dtype('int32')]
1 loop, best of 1: 10min 38s per loop

Testing HDF Storage: 测试HDF存储:

In [10]: %timeit -n 1 -r 1 df.to_hdf('c:/tmp/big.h5', 'test')
1 loop, best of 1: 22.5 s per loop

In [11]: del df

In [12]: %timeit -n 1 -r 1 df = pd.read_hdf('c:/tmp/big.h5', 'test')
1 loop, best of 1: 1.04 s per loop

Conclusion: 结论:

if you have a chance to change your storage file format - by all means don't use CSV files - use HDF5 (.h5) or Feather format... 如果您有机会更改存储文件格式-绝对不要使用CSV文件-请使用HDF5(.h5)或Feather格式...

OLD answer: 旧答案:

I would simply use the native Pandas read_csv() method: 我将只使用本机Pandas的read_csv()方法:

chunksize = 10**6
reader = pd.read_csv(filename, index_col=0, chunksize=chunksize)
df = pd.concat([chunk for chunk in reader], ignore_indexes=True)

From your code: 从您的代码:

tag = row[0] 标签=行[0]

df.loc[tag] = np.array(row[1:], dtype=dftype) df.loc [tag] = np.array(row [1:],dtype = dftype)

It looks like you want to use the first column in your CSV file as an index, hence: index_col=0 您似乎想将CSV文件中的第一列用作索引,因此: index_col=0

I suggest you use numpy array for this, for example: 我建议您为此使用numpy数组,例如:

def read_csv(fname):
    import csv
    reader = csv.reader(open(fname))
    names = reader.next()[1:]  # first row
    n = len(names)
    data = np.empty((n, n), np.int32)
    tag_map = {name:i for i, name in enumerate(names)}
    for row in reader:
        tag = row[0]
        data[tag_map[tag], :] = row[1:]
    return names, data

I don't know why int32 is slower than float32 , but DataFrame stores data column wise, set elements of every column by df.loc[tag] = ... is slow. 我不知道为什么int32float32慢,但是DataFrame列存储数据,通过df.loc[tag] = ...设置每列的元素很慢。

If you want labels for access, can you use xarray : 如果要访问标签,可以使用xarray

import xarray
d = xarray.DataArray(data, [("r", names), ("c", names)])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM