简体   繁体   English

如何在有条件的情况下遍历 Pandas 数据框? (对 iterrows/for 循环/矢量化的困惑)

[英]How do I iterate through a Pandas dataframe with conditions? (confusion over iterrows/for loops/vectorization)

I have a dataset I need to iterate on a condition:我有一个数据集,我需要在某个条件下进行迭代:

data = [[-10, 10, 'Hawaii', 'Honolulu'], [-22, 63], [32, -14]]
df = pd.DataFrame(data, columns = ['lat', 'long', 'state', 'capital'])


for x in range(len(df))
    if df['state'] and df['capital'] = np.nan:
        df['state'] = 'Investigate state'
        df['capital'] = 'Investigate capital'

My expected output is that if the state field and capital fields are both empty then fill in the empty fields respectively.我的预期输出是,如果 state 字段和 capital 字段都为空,则分别填写空字段。 The actual data I use and the function within the loop is more complex than this example but what I want to focus on is the iterative/looping portion with the condition.我使用的实际数据和循环中的函数比这个例子更复杂,但我想关注的是带有条件的迭代/循环部分。

My Googling found iterrows and I read tutorials that just say to go ahead and use a for loop.我的谷歌搜索找到了 iterrows 并且我阅读了只是说继续使用 for 循环的教程。 Stackoverflow answers denounced the two options above and advocated to use vectorization instead. Stackoverflow 的回答谴责了上述两个选项,并主张改用矢量化。 My actual dataset will have around ~20,000 rows.我的实际数据集大约有 20,000 行。 What is the most efficient implementation and how do I implement it?什么是最有效的实施,我该如何实施?

You can test each column separately and chain masks by & for bitwise AND :您可以分别测试每一列并通过&为按位AND链接掩码:

m = df['state'].isna() & df['capital'].isna()
df.loc[m, ['capital', 'state']] = ['Investigate capital','Investigate state']

Fastest is in sample data for 30k rows and 66% matching if also set columns separately:如果还单独设置列,则最快是在 30k 行和 66% 匹配的样本数据中:

m = df['state'].isna() & df['capital'].isna()
df['state']= np.where(m, 'Investigate state', df['state'])
df['capital']= np.where(m, 'Investigate capital', df['capital'])

Similar:相似的:

m = df['state'].isna() & df['capital'].isna()
df.loc[m, 'state']='Investigate state'
df.loc[m, 'capital']='Investigate capital'

#30k rows
df = pd.concat([df] * 10000, ignore_index=True)


%%timeit
    ...: m = df['state'].isna() & df['capital'].isna()
    ...: df['state']= np.where(m, 'Investigate state', df['state'])
    ...: df['capital']= np.where(m, 'Investigate capital', df['capital'])
    ...: 
3.45 ms ± 39.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit 
m = df['state'].isna() & df['capital'].isna()
df.loc[m,'state']='Investigate state'
df.loc[m,'capital']='Investigate capital'


3.58 ms ± 11 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
m = df['state'].isna() & df['capital'].isna()
df.loc[m,['capital', 'state']] = ['Investigate capital','Investigate state']

4.5 ms ± 355 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Another solutions:另一种解决方案:

%%timeit 
m=df[['state','capital']].isna().all(1)
df.loc[m]=df.loc[m].fillna({'state':'Investigate state','capital':'Investigate capital'})

6.68 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%%timeit 
m=df[['state','capital']].isna().all(1)
df.loc[m,'state']='Investigate state'
df.loc[m,'capital']='Investigate capital'


4.72 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM