简体   繁体   English

向量化一个非常简单的pandas lambda函数

[英]Vectorizing a very simple pandas lambda function in apply

pandas apply/map is my nemesis and even on small datasets can be agonizingly slow. 大熊猫的apply/map是我的宿敌,即使在小型数据集上,也可能非常缓慢。 Below is a very simple example where there is nearly a 3 order of magnitude difference in speed. 下面是一个非常简单的示例,其中速度差接近3个数量级。 Below I create a Series with 1 million values and simply want to map values greater than .5 to 'Yes' and those less than .5 to 'No'. 下面,我创建一个具有一百万个值的Series ,只想将大于.5的值映射为“是”,而将小于0.5的值映射为“否”。 How do I vectorize this or speed it up significantly? 如何向量化或显着加快速度?

ser = pd.Series(np.random.rand(1000000))

# vectorized and fast
%%timeit
ser > .5

1000 loops, best of 3: 477 µs per loop 1000个循环,最好为3:每个循环477 µs

%%timeit
ser.map(lambda x: 'Yes' if x > .5 else 'No')

1 loop, best of 3: 255 ms per loop 1个循环,最好为3:每个循环255毫秒

np.where(cond, A, B) is the vectorized equivalent of A if cond else B : np.where(cond, A, B)np.where(cond, A, B)等效项, A if cond else B

import numpy as np
import pandas as pd
ser = pd.Series(np.random.rand(1000000))
mask = ser > 0.5
result = pd.Series(np.where(mask, 'Yes', 'No'))
expected = ser.map(lambda x: 'Yes' if x > .5 else 'No')
assert result.equals(expected)

In [77]: %timeit mask = ser > 0.5
1000 loops, best of 3: 1.44 ms per loop

In [76]: %timeit np.where(mask, 'Yes', 'No')
100 loops, best of 3: 14.8 ms per loop

In [73]: %timeit pd.Series(np.where(mask, 'Yes', 'No'))
10 loops, best of 3: 86.5 ms per loop

In [74]: %timeit ser.map(lambda x: 'Yes' if x > .5 else 'No')
1 loop, best of 3: 223 ms per loop

Since this Series only has two values, you might consider using a Categorical instead: 由于该系列只有两个值,因此您可以考虑使用“ Categorical

In [94]: cat = pd.Categorical.from_codes(codes=mask.astype(int), categories=['Yes', 'No']); cat
Out[94]: 
[No, Yes, No, Yes, Yes, ..., Yes, No, Yes, Yes, No]
Length: 1000000
Categories (2, object): [Yes, No]

In [95]: %timeit pd.Categorical.from_codes(codes=mask.astype(int), categories=['Yes', 'No']); cat
100 loops, best of 3: 6.26 ms per loop

Not only is this faster, it is more memory efficient since it avoids creating the array of strings. 这样不仅速度更快,而且由于避免了创建字符串数组,因此具有更高的内存效率。 The category codes are an array of ints which map to categories: 类别代码是一个整数数组,这些整数映射到类别:

In [96]: cat.codes
Out[96]: array([1, 0, 1, ..., 0, 0, 1], dtype=int8)

In [97]: cat.categories
Out[99]: Index(['Yes', 'No'], dtype='object')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM