为什么列表理解比在 pandas 中应用更快

Question

Using List comprehensions is way faster than a normal for loop.使用列表理解比普通的 for 循环要快得多。 Reason which is given for this is that there is no need of append in list comprehensions, which is understandable.给出的原因是列表理解中不需要 append，这是可以理解的。 But I have found at various places that list comparisons are faster than apply.但是我在很多地方发现列表比较比应用要快。 I have experienced that as well.我也有过这样的经历。 But not able to understand as to what is the internal working that makes it much faster than apply?但是无法理解使它比应用快得多的内部工作是什么？

I know this has something to do with vectorization in numpy which is the base implementation of pandas dataframes.我知道这与 numpy 中的矢量化有关，这是 pandas 数据帧的基本实现。 But what causes list comprehensions better than apply, is not quite understandable, since, in list comprehensions, we give for loop inside the list, whereas in apply, we don't even give any for loop (and I assume there also, vectorization takes place)但是导致列表推导比应用更好的原因并不是很容易理解，因为在列表推导中，我们在列表中给出了 for 循环，而在应用中，我们甚至没有给出任何 for 循环（我假设那里也有矢量化需要地方）

Edit: adding code: this is working on titanic dataset, where title is extracted from name: https://www.kaggle.com/c/titanic/data编辑：添加代码：这是在泰坦尼克号数据集上工作，其中标题是从名称中提取的： https://www.kaggle.com/c/titanic/data

%timeit train['NameTitle'] = train['Name'].apply(lambda x: 'Mrs.' if 'Mrs' in x else \
                                         ('Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else\
                                                ('Master' if 'Master' in x else 'None'))))

%timeit train['NameTitle'] = ['Mrs.' if 'Mrs' in x else 'Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else ('Master' if 'Master' in x else 'None')) for x in train['Name']]

Result: 782 µs ± 6.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)结果：每个循环 782 µs ± 6.36 µs（7 次运行的平均值 ± 标准偏差，每次 1000 次循环）

499 µs ± 5.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)每个循环 499 µs ± 5.76 µs（7 次运行的平均值 ± 标准偏差，每次 1000 次循环）

Edit2: To add code for SO, was creating a simple code, and surprisingly, for below code, the results reverse: Edit2：为 SO 添加代码，创建了一个简单的代码，令人惊讶的是，对于下面的代码，结果相反：

import pandas as pd
import timeit
df_test = pd.DataFrame()
tlist = []
tlist2 = []
for i in range (0,5000000):
  tlist.append(i)
  tlist2.append(i+5)
df_test['A'] = tlist
df_test['B'] = tlist2

display(df_test.head(5))


%timeit df_test['C'] = df_test['B'].apply(lambda x: x*2 if x%5==0 else x)
display(df_test.head(5))
%timeit df_test['C'] = [ x*2 if x%5==0 else x for x in df_test['B']]

display(df_test.head(5))

1 loop, best of 3: 2.14 s per loop 1 个循环，3 个循环中的最佳：每个循环 2.14 秒

1 loop, best of 3: 2.24 s per loop 1 个循环，3 个循环中的最佳循环：每个循环 2.24 秒

Edit3: As suggested by some, that apply is essentially a for loop, which is not the case as if i run this code with for loop, it almost never ends, i had to stop it after 3-4 mins manually and it never completed during this time.: Edit3：正如一些人所建议的那样，apply 本质上是一个 for 循环，但事实并非如此，就像我用 for 循环运行这段代码一样，它几乎永远不会结束，我不得不在 3-4 分钟后手动停止它并且它从未完成在这段时间。：

for row in df_test.itertuples():
  x = row.B
  if x%5==0:
    df_test.at[row.Index,'B'] = x*2

Running above code takes around 23 seconds, but apply takes only 1.8 seconds.运行上面的代码大约需要 23 秒，但应用只需要 1.8 秒。 So, what is the difference between these physical loop in itertuples and apply?那么，itertuples和apply中的这些物理循环有什么区别呢？

Answer 1

There are a few reasons for the performance difference between apply and list comprehension. apply和 list comprehension 之间的性能差异有几个原因。

First of all, list comprehension in your code doesn't make a function call on each iteration, while apply does.首先，代码中的列表理解不会在每次迭代时调用 function，而apply会调用。 This makes a huge difference:这有很大的不同：

map_function = lambda x: 'Mrs.' if 'Mrs' in x else \
                 ('Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else \
                 ('Master' if 'Master' in x else 'None')))

%timeit train['NameTitle'] = [map_function(x) for x in train['Name']]
# 581 µs ± 21.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['NameTitle'] = ['Mrs.' if 'Mrs' in x else \
                 ('Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else \
                 ('Master' if 'Master' in x else 'None'))) for x in train['Name']]
# 482 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Secondly, apply does much more than list comprehension.其次，apply 做的远不止列表理解。 For example it tries to find appropriate dtype for the result.例如，它会尝试为结果找到合适的数据类型。 By disabling that behaviour you can see what impact it has:通过禁用该行为，您可以看到它有什么影响：

%timeit train['NameTitle'] = train['Name'].apply(map_function)
# 660 µs ± 2.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['NameTitle'] = train['Name'].apply(map_function, convert_dtype=False)
# 626 µs ± 4.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

There's also a bunch of other stuff happening within apply , so in this example you would want to use map :在apply中还有很多其他事情发生，所以在这个例子中你会想要使用map ：

%timeit train['NameTitle'] = train['Name'].map(map_function)
# 545 µs ± 4.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Which performs better than list comprehension with a function call in it.这比使用 function 调用的列表理解更好。

Then why use apply at all you might ask?那么您可能会问为什么要使用apply呢？ I know at least one example where it outperforms everything else -- when the operation you want to apply is a vectorized universal function .我知道至少有一个例子表明它优于其他一切——当您要应用的操作是矢量化通用 function 时。 That's because apply unlike map and list comprehension allows the function to run on the whole Series instead of individual objects in it.这是因为apply与map不同，列表理解允许 function 在整个 Series 而不是其中的单个对象上运行。 Let's see an example:让我们看一个例子：

%timeit train['AgeExp'] = train['Age'].apply(lambda x: np.exp(x))
# 1.44 ms ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['AgeExp'] = train['Age'].apply(np.exp)
# 256 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['AgeExp'] = train['Age'].map(np.exp)
# 1.01 ms ± 8.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['AgeExp'] = [np.exp(x) for x in train['Age']]
# 1.21 ms ± 28.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

为什么列表理解比在 pandas 中应用更快

问题描述

1 个解决方案

解决方案1
0 2023-01-06 16:40:27

为什么列表理解比在 pandas 中应用更快

问题描述

1 个解决方案

解决方案1 0 2023-01-06 16:40:27

解决方案1
0 2023-01-06 16:40:27