简体   繁体   English

为什么我的带有 if else 子句的 for 循环运行得这么慢?

[英]Why does my for loop with if else clause run so slow?

TL,DR : I'm trying to understand why the below for loop is incredibly slow, taking hours to run on a dataset of 160K entries. TL,DR :我试图理解为什么下面的 for 循环非常慢,需要数小时才能在 160K 条目的数据集上运行。

I have a working solution using a function and.apply(), but I want to understand why my homegrown solution is so bad.我有一个使用 function 和 .apply() 的工作解决方案,但我想了解为什么我的本土解决方案如此糟糕。 I'm obviously a huge beginner with Python:我显然是 Python 的初学者:

popular_or_not = []
counter = 0
for id in df['id']:
    if df['popularity'][df['id'] == id].values == 0:
        popular_or_not.append(0)
    else:
        popular_or_not.append(1)
    counter += 1

df['popular_or_not'] = popular_or_not
df

In more detail:更详细地说:

I'm currently learning Python for data science, and I'm looking at this dataset on Kaggle: https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks我目前正在学习 Python 的数据科学,我正在查看 Kaggle 上的这个数据集: https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks

I'm interesting in predicting/modelling the popularity score.我对预测/建模流行度分数很感兴趣。 It is not normally distributed:不是正态分布的:

plt.bar(df['popularity'].value_counts().index, df['popularity'].value_counts().values)

在此处输入图像描述

I would like to add a column, to say whether a track is popular or not, with popular tracks being those that get a score of 5 and above and unpopular being the others.我想加一个栏目,说一首歌曲是否流行,流行歌曲是那些得分在5及以上的歌曲,不受欢迎的歌曲是其他歌曲。

I have tried the following solution, but it runs incredibly slowly, and I'm not sure why.我尝试了以下解决方案,但运行速度非常慢,我不知道为什么。 It runs fine on a very small subset, but would take a few hours to run on the full dataset:它在一个非常小的子集上运行良好,但在完整数据集上运行需要几个小时:

popular_or_not = []
counter = 0
for id in df['id']:
    if df['popularity'][df['id'] == id].values == 0:
        popular_or_not.append(0)
    else:
        popular_or_not.append(1)
    counter += 1

df['popular_or_not'] = popular_or_not
df

This alternative solution works fine:这种替代解决方案工作正常:

def check_popularity(score):
    if score > 5:
        return 1
    else:
        #pdb.set_trace()
        return 0
df['popularity'].apply(check_popularity).value_counts()
df['popular_or_not'] = df['popularity'].apply(check_popularity)

I think understanding why my first solution doesn't work might be an important part of my Python learning.我认为理解为什么我的第一个解决方案不起作用可能是我学习 Python 的重要部分。

Thanks everyone for your comments.感谢大家的意见。 I'm going to summarize them below as an answer to my question, but please feel free to jump in if anything is incorrect:我将在下面总结它们作为我问题的答案,但如果有任何不正确之处,请随时加入:

The reason my initial for loop was so slow is that I was checking df['id'] == id 160k times.我最初的 for 循环如此缓慢的原因是我检查了 df['id'] == id 160k 次。 This is typically a very slow operation.这通常是一个非常缓慢的操作。

For this type of operation, instead of iterating over a pandas dataframe thousands of times, it's always a good idea to think of applying vectorization - a bunch of tools and methods to process a whole column in a single instruction at C speed.对于这种类型的操作,与其对 pandas dataframe 进行数千次迭代,不如考虑应用矢量化 - 一组工具和方法在一条指令中以 Z0D61F8370CAD1D412F5 的速度处理一整列。 This is what I did with the following code:这就是我使用以下代码所做的:

def check_popularity(score):
    if score > 5:
        return 1
    else:
        #pdb.set_trace()
        return 0
df['popularity'].apply(check_popularity).value_counts()
df['popular_or_not'] = df['popularity'].apply(check_popularity)

By using.apply and a pre-defined function.通过使用.apply 和一个预定义的 function。 I get the same result, but in seconds instead of in hours.我得到了相同的结果,但在几秒钟而不是几小时内。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM