为什么我的带有 if else 子句的 for 循环运行得这么慢？

Question

TL,DR : I'm trying to understand why the below for loop is incredibly slow, taking hours to run on a dataset of 160K entries. TL,DR ：我试图理解为什么下面的 for 循环非常慢，需要数小时才能在 160K 条目的数据集上运行。

I have a working solution using a function and.apply(), but I want to understand why my homegrown solution is so bad.我有一个使用 function 和 .apply() 的工作解决方案，但我想了解为什么我的本土解决方案如此糟糕。 I'm obviously a huge beginner with Python:我显然是 Python 的初学者：

popular_or_not = []
counter = 0
for id in df['id']:
    if df['popularity'][df['id'] == id].values == 0:
        popular_or_not.append(0)
    else:
        popular_or_not.append(1)
    counter += 1

df['popular_or_not'] = popular_or_not
df

In more detail:更详细地说：

I'm currently learning Python for data science, and I'm looking at this dataset on Kaggle: https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks我目前正在学习 Python 的数据科学，我正在查看 Kaggle 上的这个数据集： https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks

I'm interesting in predicting/modelling the popularity score.我对预测/建模流行度分数很感兴趣。 It is not normally distributed:它不是正态分布的：

plt.bar(df['popularity'].value_counts().index, df['popularity'].value_counts().values)

I would like to add a column, to say whether a track is popular or not, with popular tracks being those that get a score of 5 and above and unpopular being the others.我想加一个栏目，说一首歌曲是否流行，流行歌曲是那些得分在5及以上的歌曲，不受欢迎的歌曲是其他歌曲。

I have tried the following solution, but it runs incredibly slowly, and I'm not sure why.我尝试了以下解决方案，但运行速度非常慢，我不知道为什么。 It runs fine on a very small subset, but would take a few hours to run on the full dataset:它在一个非常小的子集上运行良好，但在完整数据集上运行需要几个小时：

popular_or_not = []
counter = 0
for id in df['id']:
    if df['popularity'][df['id'] == id].values == 0:
        popular_or_not.append(0)
    else:
        popular_or_not.append(1)
    counter += 1

df['popular_or_not'] = popular_or_not
df

This alternative solution works fine:这种替代解决方案工作正常：

def check_popularity(score):
    if score > 5:
        return 1
    else:
        #pdb.set_trace()
        return 0
df['popularity'].apply(check_popularity).value_counts()
df['popular_or_not'] = df['popularity'].apply(check_popularity)

I think understanding why my first solution doesn't work might be an important part of my Python learning.我认为理解为什么我的第一个解决方案不起作用可能是我学习 Python 的重要部分。

Answer 1

Thanks everyone for your comments.感谢大家的意见。 I'm going to summarize them below as an answer to my question, but please feel free to jump in if anything is incorrect:我将在下面总结它们作为我问题的答案，但如果有任何不正确之处，请随时加入：

The reason my initial for loop was so slow is that I was checking df['id'] == id 160k times.我最初的 for 循环如此缓慢的原因是我检查了 df['id'] == id 160k 次。 This is typically a very slow operation.这通常是一个非常缓慢的操作。

For this type of operation, instead of iterating over a pandas dataframe thousands of times, it's always a good idea to think of applying vectorization - a bunch of tools and methods to process a whole column in a single instruction at C speed.对于这种类型的操作，与其对 pandas dataframe 进行数千次迭代，不如考虑应用矢量化 - 一组工具和方法在一条指令中以 Z0D61F8370CAD1D412F5 的速度处理一整列。 This is what I did with the following code:这就是我使用以下代码所做的：

def check_popularity(score):
    if score > 5:
        return 1
    else:
        #pdb.set_trace()
        return 0
df['popularity'].apply(check_popularity).value_counts()
df['popular_or_not'] = df['popularity'].apply(check_popularity)

By using.apply and a pre-defined function.通过使用.apply 和一个预定义的 function。 I get the same result, but in seconds instead of in hours.我得到了相同的结果，但在几秒钟而不是几小时内。

为什么我的带有 if else 子句的 for 循环运行得这么慢？

问题描述

1 个解决方案

解决方案1
0 2021-03-13 10:19:33

为什么我的带有 if else 子句的 for 循环运行得这么慢？

问题描述

1 个解决方案

解决方案1 0 2021-03-13 10:19:33

解决方案1
0 2021-03-13 10:19:33