简体   繁体   English

熊猫将convert转换为np.int,lambda和astype()之间的int差异

[英]pandas apply convert to int differences between np.int, lambda and astype()

Given a df 给定df

df = pd.DataFrame(['0', '1', '2', '3'], columns=['a'])

What is the difference between using 使用之间有什么区别

 df['b'] = df['a'].apply(np.int)

,

df['b'] = df['a'].apply(lambda x : int(x))

and

df['b'] = df['a'].astype(int)

?

I'm aware that all will give the same result but are there any differences? 我知道所有人都会得到相同的结果,但是有什么区别吗?

np.int is an alias for int. np.int是int的别名。

You can test this by running: 您可以通过运行以下命令进行测试:

import numpy as np
print(int == np.int)

which will return True. 这将返回True。

Also: consider checking out this question which has a very thorough explanation of your question. 另外:考虑查看该问题该问题对您的问题有非常详尽的解释。

The below uses pandas apply function to iteratively use numpy's int cast which is same as python's int cast. 下面使用pandas apply函数来迭代地使用numpy的int cast与python的int cast相同。 So, both of these are alas the same. 因此,这两个都一样。

df['b'] = df['a'].apply(np.int)
df['b'] = df['a'].apply(lambda x : int(x))

The astype function however casts an series to specified dtype, here int which for pandas is int64 . 但是,astype函数将一系列转换为指定的dtype,这里的int对于大熊猫来说是int64

df['b'] = df['a'].astype(int)

astype is a vectorized function and I would prefer to use it rather than the apply method due to its poor time complexity as compared to astype . astype是向量化函数,由于与astype相比其时间复杂度较低 ,因此我宁愿使用它而不是apply方法。

When you use apply it works by looping over the data and changing the dtype of each value to integer. 使用apply它通过遍历数据并将每个值的dtype更改为整数来工作。 So they are slower when compared to astype 因此,与astype相比,它们速度较慢

df = pd.DataFrame(pd.np.arange(10**7).reshape(10**4, 10**3)).astype(str)

# Performance
%timeit df[0].apply(np.int)
7.15 ms ± 319 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df[0].apply(lambda x : int(x))
9.57 ms ± 405 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Both are almost similar in terms of performance. 两者在性能方面几乎相似。

Here astype which is function optimized to work faster than apply. 在此,对astype进行功能优化,以使其比应用更快。

%timeit df[0].astype(int)
1.94 ms ± 96.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And If you are looking for a much much faster approach then we should opt for vectorized approach which numpy arrays can provide. 而且,如果您正在寻找一种更快的方法,那么我们应该选择numpy数组可以提供的矢量化方法。

%timeit df[0].values.astype(np.int)
1.26 ms ± 19.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

As you can see the time difference is huge. 如您所见,时差巨大。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM