[英]Fastest way to cast all dataframe columns to float - pandas astype slow
Is there a faster way to cast all columns of a pandas dataframe to a single type? 有没有更快的方法将pandas数据帧的所有列转换为单一类型? This seems particularly slow:
这似乎特别慢:
df = df.apply(lambda x: x.astype(np.float64), axis=1)
I suspect there's not much I can do about it because of the memory allocation overhead of numpy.ndarray.astype
. 我怀疑由于
numpy.ndarray.astype
的内存分配开销,我numpy.ndarray.astype
。
I've also tried pd.to_numeric
but it arbitrarily chooses to cast a few of my columns into int
types instead. 我也尝试了
pd.to_numeric
但它任意选择将我的一些列转换为int
类型。
No need for apply
, just use DataFrame.astype
directly. 无需
apply
,只需直接使用DataFrame.astype
即可。
df.astype(np.float64)
apply
-ing is also going to give you a pretty bad performance hit. apply
-ing也会给你一个非常糟糕的性能打击。
Example 例
df = pd.DataFrame(np.arange(10**7).reshape(10**4, 10**3))
%timeit df.astype(np.float64)
1 loop, best of 3:
288 ms per loop
%timeit df.apply(lambda x: x.astype(np.float64),
axis=0)
1 loop, best of 3:
748 ms per loop
%timeit df.apply(lambda x: x.astype(np.float64),
axis=1)
1 loop, best of 3:
2.95 s per loop
One efficient way would be to work with array data and cast it back to a dataframe, like so - 一种有效的方法是使用数组数据并将其转换回数据帧,如下所示 -
pd.DataFrame(df.values.astype(np.float64))
Runtime test - 运行时测试 -
In [144]: df = pd.DataFrame(np.random.randint(11,99,(5000,5000)))
In [145]: %timeit df.astype(np.float64) # @Mitch's soln
10 loops, best of 3: 121 ms per loop
In [146]: %timeit pd.DataFrame(df.values.astype(np.float64))
10 loops, best of 3: 42.5 ms per loop
The casting back to dataframe wasn't that costly - 重新投入数据框并不是那么昂贵 -
In [147]: %timeit df.values.astype(np.float64)
10 loops, best of 3: 42.3 ms per loop # Casting to dataframe costed 0.2ms
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.