向量化一个非常简单的pandas lambda函数

Question

pandas apply/map is my nemesis and even on small datasets can be agonizingly slow. 大熊猫的apply/map是我的宿敌，即使在小型数据集上，也可能非常缓慢。 Below is a very simple example where there is nearly a 3 order of magnitude difference in speed. 下面是一个非常简单的示例，其中速度差接近3个数量级。 Below I create a Series with 1 million values and simply want to map values greater than .5 to 'Yes' and those less than .5 to 'No'. 下面，我创建一个具有一百万个值的Series ，只想将大于.5的值映射为“是”，而将小于0.5的值映射为“否”。 How do I vectorize this or speed it up significantly? 如何向量化或显着加快速度？

ser = pd.Series(np.random.rand(1000000))

# vectorized and fast
%%timeit
ser > .5

1000 loops, best of 3: 477 µs per loop 1000个循环，最好为3：每个循环477 µs

%%timeit
ser.map(lambda x: 'Yes' if x > .5 else 'No')

1 loop, best of 3: 255 ms per loop 1个循环，最好为3：每个循环255毫秒

Answer 1

np.where(cond, A, B) is the vectorized equivalent of A if cond else B : np.where(cond, A, B)是np.where(cond, A, B)等效项， A if cond else B ：

import numpy as np
import pandas as pd
ser = pd.Series(np.random.rand(1000000))
mask = ser > 0.5
result = pd.Series(np.where(mask, 'Yes', 'No'))
expected = ser.map(lambda x: 'Yes' if x > .5 else 'No')
assert result.equals(expected)

In [77]: %timeit mask = ser > 0.5
1000 loops, best of 3: 1.44 ms per loop

In [76]: %timeit np.where(mask, 'Yes', 'No')
100 loops, best of 3: 14.8 ms per loop

In [73]: %timeit pd.Series(np.where(mask, 'Yes', 'No'))
10 loops, best of 3: 86.5 ms per loop

In [74]: %timeit ser.map(lambda x: 'Yes' if x > .5 else 'No')
1 loop, best of 3: 223 ms per loop

Since this Series only has two values, you might consider using a Categorical instead: 由于该系列只有两个值，因此您可以考虑使用“ Categorical ：

In [94]: cat = pd.Categorical.from_codes(codes=mask.astype(int), categories=['Yes', 'No']); cat
Out[94]: 
[No, Yes, No, Yes, Yes, ..., Yes, No, Yes, Yes, No]
Length: 1000000
Categories (2, object): [Yes, No]

In [95]: %timeit pd.Categorical.from_codes(codes=mask.astype(int), categories=['Yes', 'No']); cat
100 loops, best of 3: 6.26 ms per loop

Not only is this faster, it is more memory efficient since it avoids creating the array of strings. 这样不仅速度更快，而且由于避免了创建字符串数组，因此具有更高的内存效率。 The category codes are an array of ints which map to categories: 类别代码是一个整数数组，这些整数映射到类别：

In [96]: cat.codes
Out[96]: array([1, 0, 1, ..., 0, 0, 1], dtype=int8)

In [97]: cat.categories
Out[99]: Index(['Yes', 'No'], dtype='object')

向量化一个非常简单的pandas lambda函数

问题描述

1 个解决方案

解决方案1
6 已采纳 2016-07-06 01:16:50

向量化一个非常简单的pandas lambda函数

问题描述

1 个解决方案

解决方案1 6 已采纳 2016-07-06 01:16:50

解决方案1
6 已采纳 2016-07-06 01:16:50