简体   繁体   English

纯 python 比 numpy 快用于数据类型转换

[英]pure python faster than numpy for data type conversion

Excuse me for my ignorance.原谅我的无知。

If numpy provides vectorized operations that make computation faster, how is that for data type conversion pure python is almost 8 times faster?如果 numpy 提供了使计算速度更快的矢量化操作,那么对于数据类型转换来说,纯 python 的速度几乎快 8 倍?

eg例如

a = np.random.randint(0,500,100).astype(str)
b = np.random.randint(0,500,100).astype(str)
c = np.random.randint(0,500,100).astype(str)

def A(a,b,c):
    for i,j,k in zip(a,b,c):
        d,e,f = int(i), int(j), int(k)
        r = d+e-f
    return 

def B(a,b,c):
    for i,j,k in zip(a,b,c):
        d,e,f  = np.array([i,j,k]).astype(int)
        r = d+e-f
    return 

Then,然后,

%%timeit 
A(a,b,c)

249 µs ± 3.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)每个循环 249 µs ± 3.13 µs(7 次运行的平均值 ± 标准偏差,每次 1000 个循环)

%%timeit
B(a,b,c)

1.87 ms ± 4.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)每个循环 1.87 毫秒 ± 4.08 微秒(平均值 ± 标准偏差。7 次运行,每次 1000 次循环)

Thank you, Ariel谢谢你,爱丽儿

Yes, NumPy does provide vectorized operations that make computations faster than vanilla Python code.是的,NumPy确实提供了矢量化操作,使计算比普通的 Python 代码更快。 However, you aren't using them.但是,您没有使用它们。

NumPy is intended to perform operations across entire datasets, not many repeated operations across chunks a dataset. NumPy 旨在跨整个数据集执行操作,跨数据集块的重复操作不多。 The latter causes iteration to be done at the Python level, which will increase runtime.后者导致在 Python 级别进行迭代,这将增加运行时间。

Your primary issue is that the only "vectorized" operation you are using is astype , but you're applying it to three elements at a time, and still looping just as much as the naive Python solution.您的主要问题是您使用的唯一“矢量化”操作是astype ,但您一次将其应用于三个元素,并且仍然像天真的 Python 解决方案一样循环。 Combine that with the fact that you incur additional overhead from creating numpy arrays at each iteration of your loop, it's no wonder your attempt with numpy is slower.再加上在循环的每次迭代中创建 numpy arrays 会产生额外的开销,难怪您尝试使用 numpy 会更慢。

On tiny datasets, Python can be faster, since NumPy has overhead from creating arrays, passing objects to and from lower-level libraries, etc.. Let's take a look at the casting operation you are using on three elements at a time:在微小的数据集上,Python可以更快,因为 NumPy 有创建 arrays 的开销,将对象传入和传出三个元素看看你正在使用的铸造操作。

%timeit np.array(['1', '2', '3']).astype(int)
5.25 µs ± 89.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit np.array(['1', '2', '3'])
1.62 µs ± 42.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Over a quarter of the runtime is just from allocating the array: Compare this to your pure Python version:超过四分之一的运行时间仅来自分配数组:将此与您的纯 Python 版本进行比较:

%timeit a, b, c = int('1'), int('2'), int('3')
659 ns ± 50.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

So if you operate only on chunks of this size, Python will beat NumPy.因此,如果您只对这种大小的块进行操作,Python 将击败 NumPy。


But you have many more elements than just three, so NumPy can be used to speed up your code substantially, but you need to change your mindset about how you approach the problem.但是您的元素不止三个,因此 NumPy用于显着加快代码速度,但您需要改变处理问题的思维方式。 Instead of focusing on how the operation gets applied to individual scalars, think about how it gets applied to arrays .与其关注如何将操作应用于单个标量,不如考虑如何将其应用于arrays


To vectorize this problem, the general idea is:为了向量化这个问题,一般的想法是:

  • Create a single array containing all your values创建一个包含所有值的数组
  • Convert the entire array to int with a single astype call.使用单个astype调用将整个数组转换为int
  • Take advance of elementwise operations to apply your desired arithmetic to the array.提前进行元素操作以将所需的算术应用于数组。

It ends up looking like this:它最终看起来像这样:

def vectorized(a, b, c):
    u = np.array([a, b, c]).astype(int)
    return u[0] + u[1] - u[2]

Once you compare two approaches where NumPy is being used correctly, you will start to see large performance increases.一旦您比较了正确使用 NumPy 的两种方法,您将开始看到性能大幅提升。

def python_loop(a, b, c):
    out = []
    for i,j,k in zip(a,b,c):
        d,e,f = int(i), int(j), int(k)
        out.append(d+e-f)
    return out

a, b, c = np.random.randint(0, 500, (3, 100_000)).astype(str)

In [255]: %timeit vectorized(a, b, c)
181 ms ± 6.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [256]: %timeit python_loop(a, b, c)
206 ms ± 7.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> np.array_equal(python_loop(a, b, c), vectorized(a, b, c))
True

Converting from strings to integers is not something that NumPy will do that much faster than pure Python, as you can see from the timings, the two are fairly close.从字符串转换为整数并不是 NumPy 比纯 Python 做得快得多,从时序中可以看出,两者相当接近。 However, by applying a vectorized approach, the comparison is at least much fairer.但是,通过应用矢量化方法,比较至少要公平得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM