为什么int转换比panda中的float慢得多？

Question

I have a 4Gb CSV file with strictly integer data I want to read into pandas DataFrame. 我有一个4Gb CSV文件，其中包含严格的整数数据，我想读入pandas DataFrame。 Native read_csv consumes all RAM (64Gb) and fails with MemoryError. 本机read_csv消耗了所有RAM（64Gb），并且由于MemoryError失败。 With explicit dtype, it just takes forever (tried both int and float types). 使用显式dtype，它将永久使用（尝试了int和float类型）。

So, I wrote my own reader: 所以，我写了我自己的读者：

def read_csv(fname):
    import csv
    reader = csv.reader(open(fname))
    names = reader.next()[1:]  # first row
    dftype = np.float32
    df = pd.DataFrame(0, dtype=dftype, columns=names, index=names)
    for row in reader:
        tag = row[0]
        df.loc[tag] = np.array(row[1:], dtype=dftype)
    return df

Problem: line df.loc[tag] = np.array(row[1:], dtype=dftype) is ~1000 times slower if dftype is np.int32 (~20sec per line), so I ended up using np.float64 and return df.astype(np.int32) (~4 minutes). 问题：如果dftype为np.int32（每行约20秒），则df.loc[tag] = np.array(row[1:], dtype=dftype)慢约1000倍，所以我最终使用了np.float64并return df.astype(np.int32) （约4分钟）。 I also tried Python conversion ([int/float(v) for v in row[1:]]) with the same result. 我也尝试了Python转换（[row [1：]中v的[int / float（v）]），结果相同。

Why could it be so? 为什么会这样呢？

UPD: I have the same behavior on Python 2.7 and 3.5 UPD：我在Python 2.7和3.5上具有相同的行为

Answer 1

UPDATE: my notebook has 16GB of RAM, so i'll test it with 4 times (64GB / 16Gb = 4) smaller DF: 更新：我的笔记本电脑具有16GB的RAM，因此我将使用4倍（64GB / 16Gb = 4）较小的DF对其进行测试：

Setup: 设定：

In [1]: df = pd.DataFrame(np.random.randint(0, 10*6, (12000, 47395)), dtype=np.int32)

In [2]: df.shape
Out[2]: (12000, 47395)

In [3]: %timeit -n 1 -r 1 df.to_csv('c:/tmp/big.csv', chunksize=1000)
1 loop, best of 1: 5min 34s per loop

Let's also save this DF in Feather format: 我们还以Feather格式保存此DF：

In [4]: import feather

In [6]: df = df.copy()

In [7]: %timeit -n 1 -r 1 feather.write_dataframe(df, 'c:/tmp/big.feather')
1 loop, best of 1: 8.41 s per loop  # yay, it's bit faster...

In [8]: df.shape
Out[8]: (12000, 47395)

In [9]: del df

and read it back: 并读回：

In [10]: %timeit -n 1 -r 1 df = feather.read_dataframe('c:/tmp/big.feather')
1 loop, best of 1: 17.4 s per loop  # reading is reasonably fast as well

reading from CSV file in chunks is much slower, but it is still not giving me MemoryError : 从CSV文件中分块读取速度要慢得多，但是仍然不能给我MemoryError ：

In [2]: %%timeit -n 1 -r 1
   ...: df = pd.DataFrame()
   ...: for chunk in pd.read_csv('c:/tmp/big.csv', index_col=0, chunksize=1000):
   ...:     df = pd.concat([df, chunk])
   ...:     print(df.shape)
   ...: print(df.dtypes.unique())
   ...:
(1000, 47395)
(2000, 47395)
(3000, 47395)
(4000, 47395)
(5000, 47395)
(6000, 47395)
(7000, 47395)
(8000, 47395)
(9000, 47395)
(10000, 47395)
(11000, 47395)
(12000, 47395)
[dtype('int64')]
1 loop, best of 1: 9min 25s per loop

now let's specify dtype=np.int32 explicitly: 现在让我们明确指定dtype=np.int32 ：

In [1]: %%timeit -n 1 -r 1
   ...: df = pd.DataFrame()
   ...: for chunk in pd.read_csv('c:/tmp/big.csv', index_col=0, chunksize=1000, dtype=np.int32):
   ...:     df = pd.concat([df, chunk])
   ...:     print(df.shape)
   ...: print(df.dtypes.unique())
   ...:
(1000, 47395)
(2000, 47395)
(3000, 47395)
(4000, 47395)
(5000, 47395)
(6000, 47395)
(7000, 47395)
(8000, 47395)
(9000, 47395)
(10000, 47395)
(11000, 47395)
(12000, 47395)
[dtype('int32')]
1 loop, best of 1: 10min 38s per loop

Testing HDF Storage: 测试HDF存储：

In [10]: %timeit -n 1 -r 1 df.to_hdf('c:/tmp/big.h5', 'test')
1 loop, best of 1: 22.5 s per loop

In [11]: del df

In [12]: %timeit -n 1 -r 1 df = pd.read_hdf('c:/tmp/big.h5', 'test')
1 loop, best of 1: 1.04 s per loop

Conclusion: 结论：

if you have a chance to change your storage file format - by all means don't use CSV files - use HDF5 (.h5) or Feather format... 如果您有机会更改存储文件格式-绝对不要使用CSV文件-请使用HDF5（.h5）或Feather格式...

OLD answer: 旧答案：

I would simply use the native Pandas read_csv() method: 我将只使用本机Pandas的read_csv（）方法：

chunksize = 10**6
reader = pd.read_csv(filename, index_col=0, chunksize=chunksize)
df = pd.concat([chunk for chunk in reader], ignore_indexes=True)

From your code: 从您的代码：

tag = row[0] 标签=行[0]

df.loc[tag] = np.array(row[1:], dtype=dftype) df.loc [tag] = np.array（row [1：]，dtype = dftype）

It looks like you want to use the first column in your CSV file as an index, hence: index_col=0 您似乎想将CSV文件中的第一列用作索引，因此： index_col=0

Answer 2

I suggest you use numpy array for this, for example: 我建议您为此使用numpy数组，例如：

def read_csv(fname):
    import csv
    reader = csv.reader(open(fname))
    names = reader.next()[1:]  # first row
    n = len(names)
    data = np.empty((n, n), np.int32)
    tag_map = {name:i for i, name in enumerate(names)}
    for row in reader:
        tag = row[0]
        data[tag_map[tag], :] = row[1:]
    return names, data

I don't know why int32 is slower than float32 , but DataFrame stores data column wise, set elements of every column by df.loc[tag] = ... is slow. 我不知道为什么int32比float32慢，但是DataFrame列存储数据，通过df.loc[tag] = ...设置每列的元素很慢。

If you want labels for access, can you use xarray : 如果要访问标签，可以使用xarray ：

import xarray
d = xarray.DataArray(data, [("r", names), ("c", names)])

为什么int转换比panda中的float慢得多？

问题描述

2 个解决方案

解决方案1
1 2017-01-13 22:11:01

Conclusion: 结论：

解决方案2
1 已采纳 2017-01-14 01:20:14

为什么int转换比panda中的float慢得多？

问题描述

2 个解决方案

解决方案1 1 2017-01-13 22:11:01

Conclusion: 结论：

解决方案2 1 已采纳 2017-01-14 01:20:14

解决方案1
1 2017-01-13 22:11:01

解决方案2
1 已采纳 2017-01-14 01:20:14