迭代熊猫数据框并应用条件函数的更快方法

Question

Summary 摘要

I am trying to iterate over a large dataframe. 我正在尝试遍历大型数据框。 Identify unique groups based on several columns, apply the mean to another column based on how many are in the group. 根据几列来确定唯一组，然后根据组中有多少列将平均值应用于另一列。 My current approach is very slow when iterating over a large dataset and applying the average function across many columns. 当迭代大型数据集并将平均值函数应用于许多列时，我当前的方法非常慢。 Is there a way I can do this more efficiently? 有办法可以更有效地做到这一点吗？

Example 例

Here's a example of the problem. 这是问题的一个例子。 I want to find unique combinations of ['A', 'B', 'C']. 我想找到['A'，'B'，'C']的唯一组合。 For each unique combination, I want the value of column ['D'] / number of rows in the group. 对于每个唯一的组合，我想要列['D']的值/组中的行数。

Edit: Resulting dataframe should preserve the duplicated groups. 编辑：结果数据框应保留重复的组。 But with edited column 'D' 但是带有已编辑的列“ D”

import pandas as pd
import numpy as np
import datetime

def time_mean_rows():
    # Generate some random data
    A = np.random.randint(0, 5, 1000)
    B = np.random.randint(0, 5, 1000)
    C = np.random.randint(0, 5, 1000)
    D = np.random.randint(0, 10, 1000)

    # init dataframe
    df = pd.DataFrame(data=[A, B, C, D]).T
    df.columns = ['A', 'B', 'C', 'D']


    tstart = datetime.datetime.now()

    # Get unique combinations of A, B, C
    unique_groups = df[['A', 'B', 'C']].drop_duplicates().reset_index()

    # Iterate unique groups
    normalised_solutions = []
    for idx, row in unique_groups.iterrows():
        # Subset dataframe to the unique group
        sub_df = df[
            (df['A'] == row['A']) &
            (df['B'] == row['B']) & 
            (df['C'] == row['C'])
            ]

        # If more than one solution, get mean of column D
        num_solutions = len(sub_df)        
        if num_solutions > 1:
            sub_df.loc[:, 'D'] = sub_df.loc[:,'D'].values.sum(axis=0) / num_solutions
            normalised_solutions.append(sub_df)

    # Concatenate results
    res = pd.concat(normalised_solutions)

    tend = datetime.datetime.now()
    time_elapsed = (tstart - tend).seconds
    print(time_elapsed)

I know the section causing slowdown is when num_solutions > 1. How can I do this more efficiently 我知道导致减速的部分是num_solutions>1。如何更有效地执行此操作？

Answer 1

嗯，为什么不使用groupby？

df_res = df.groupby(['A', 'B', 'C'])['D'].mean().reset_index()

Answer 2

This is a complement to AT_asks's answer which only gave the first part of the solution. 这是对AT_asks回答的补充，后者仅给出了解决方案的第一部分。

Once we have df.groupby(['A', 'B', 'C'])['D'].mean() we can use it to change the value of the column 'D' in a copy of the original dataframe provided we use a dataframe sharing same index. 一旦有了df.groupby(['A', 'B', 'C'])['D'].mean()我们就可以使用它来更改原始副本中列'D'的值数据框，前提是我们使用共享相同索引的数据框。 The global solution is then: 全局解决方案是：

res = df.set_index(['A', 'B', 'C']).assign(
    D=df.groupby(['A', 'B', 'C'])['D'].mean()).reset_index()

This will contains same rows (even if a different order that the res dataframe from OP's question. 这将包含相同的行（即使与OP问题中的res数据框的顺序不同）。

Answer 3

Here's a solution I found 这是我找到的解决方案

Using groupby as suggested by AT, then merging back to the original df and dropping the original ['D', 'E'] columns. 按照AT的建议使用groupby，然后合并回原始df并删除原始的['D'，'E']列。 Nice speedup! 不错的加速！

def time_mean_rows():
    # Generate some random data
    np.random.seed(seed=42)
    A = np.random.randint(0, 10, 10000)
    B = np.random.randint(0, 10, 10000)
    C = np.random.randint(0, 10, 10000)
    D = np.random.randint(0, 10, 10000)
    E = np.random.randint(0, 10, 10000)

    # init dataframe
    df = pd.DataFrame(data=[A, B, C, D, E]).T
    df.columns = ['A', 'B', 'C', 'D', 'E']

    tstart_grpby = timer()
    cols = ['D', 'E']

    group_df = df.groupby(['A', 'B', 'C'])[cols].mean().reset_index()

    # Merge df
    df = pd.merge(df, group_df, how='left', on=['A', 'B', 'C'], suffixes=('_left', ''))

    # Get left columns (have not been normalised) and drop
    drop_cols = [x for x in df.columns if x.endswith('_left')]
    df.drop(drop_cols, inplace=True, axis='columns')

    tend_grpby = timer()
    time_elapsed_grpby = timedelta(seconds=tend_grpby-tstart_grpby).total_seconds()
    print(time_elapsed_grpby)

迭代熊猫数据框并应用条件函数的更快方法

问题描述

Summary 摘要

Example 例

3 个解决方案

解决方案1
2 2019-05-23 08:41:44

解决方案2
1 已采纳 2019-05-23 10:01:23

解决方案3
0 2019-05-23 10:00:54

迭代熊猫数据框并应用条件函数的更快方法

问题描述

Summary 摘要

Example 例

3 个解决方案

解决方案1 2 2019-05-23 08:41:44

解决方案2 1 已采纳 2019-05-23 10:01:23

解决方案3 0 2019-05-23 10:00:54

解决方案1
2 2019-05-23 08:41:44

解决方案2
1 已采纳 2019-05-23 10:01:23

解决方案3
0 2019-05-23 10:00:54