[英]Quicker way to iterate pandas dataframe and apply a conditional function
I am trying to iterate over a large dataframe. 我正在尝试遍历大型数据框。 Identify unique groups based on several columns, apply the mean to another column based on how many are in the group.
根据几列来确定唯一组,然后根据组中有多少列将平均值应用于另一列。 My current approach is very slow when iterating over a large dataset and applying the average function across many columns.
当迭代大型数据集并将平均值函数应用于许多列时,我当前的方法非常慢。 Is there a way I can do this more efficiently?
有办法可以更有效地做到这一点吗?
Here's a example of the problem. 这是问题的一个例子。 I want to find unique combinations of ['A', 'B', 'C'].
我想找到['A','B','C']的唯一组合。 For each unique combination, I want the value of column ['D'] / number of rows in the group.
对于每个唯一的组合,我想要列['D']的值/组中的行数。
Edit: Resulting dataframe should preserve the duplicated groups. 编辑:结果数据框应保留重复的组。 But with edited column 'D'
但是带有已编辑的列“ D”
import pandas as pd
import numpy as np
import datetime
def time_mean_rows():
# Generate some random data
A = np.random.randint(0, 5, 1000)
B = np.random.randint(0, 5, 1000)
C = np.random.randint(0, 5, 1000)
D = np.random.randint(0, 10, 1000)
# init dataframe
df = pd.DataFrame(data=[A, B, C, D]).T
df.columns = ['A', 'B', 'C', 'D']
tstart = datetime.datetime.now()
# Get unique combinations of A, B, C
unique_groups = df[['A', 'B', 'C']].drop_duplicates().reset_index()
# Iterate unique groups
normalised_solutions = []
for idx, row in unique_groups.iterrows():
# Subset dataframe to the unique group
sub_df = df[
(df['A'] == row['A']) &
(df['B'] == row['B']) &
(df['C'] == row['C'])
]
# If more than one solution, get mean of column D
num_solutions = len(sub_df)
if num_solutions > 1:
sub_df.loc[:, 'D'] = sub_df.loc[:,'D'].values.sum(axis=0) / num_solutions
normalised_solutions.append(sub_df)
# Concatenate results
res = pd.concat(normalised_solutions)
tend = datetime.datetime.now()
time_elapsed = (tstart - tend).seconds
print(time_elapsed)
I know the section causing slowdown is when num_solutions > 1. How can I do this more efficiently 我知道导致减速的部分是num_solutions>1。如何更有效地执行此操作?
嗯,为什么不使用groupby?
df_res = df.groupby(['A', 'B', 'C'])['D'].mean().reset_index()
This is a complement to AT_asks's answer which only gave the first part of the solution. 这是对AT_asks回答的补充,后者仅给出了解决方案的第一部分。
Once we have df.groupby(['A', 'B', 'C'])['D'].mean()
we can use it to change the value of the column 'D'
in a copy of the original dataframe provided we use a dataframe sharing same index. 一旦有了
df.groupby(['A', 'B', 'C'])['D'].mean()
我们就可以使用它来更改原始副本中列'D'
的值数据框,前提是我们使用共享相同索引的数据框。 The global solution is then: 全局解决方案是:
res = df.set_index(['A', 'B', 'C']).assign(
D=df.groupby(['A', 'B', 'C'])['D'].mean()).reset_index()
This will contains same rows (even if a different order that the res
dataframe from OP's question. 这将包含相同的行(即使与OP问题中的
res
数据框的顺序不同)。
Here's a solution I found 这是我找到的解决方案
Using groupby as suggested by AT, then merging back to the original df and dropping the original ['D', 'E'] columns. 按照AT的建议使用groupby,然后合并回原始df并删除原始的['D','E']列。 Nice speedup!
不错的加速!
def time_mean_rows():
# Generate some random data
np.random.seed(seed=42)
A = np.random.randint(0, 10, 10000)
B = np.random.randint(0, 10, 10000)
C = np.random.randint(0, 10, 10000)
D = np.random.randint(0, 10, 10000)
E = np.random.randint(0, 10, 10000)
# init dataframe
df = pd.DataFrame(data=[A, B, C, D, E]).T
df.columns = ['A', 'B', 'C', 'D', 'E']
tstart_grpby = timer()
cols = ['D', 'E']
group_df = df.groupby(['A', 'B', 'C'])[cols].mean().reset_index()
# Merge df
df = pd.merge(df, group_df, how='left', on=['A', 'B', 'C'], suffixes=('_left', ''))
# Get left columns (have not been normalised) and drop
drop_cols = [x for x in df.columns if x.endswith('_left')]
df.drop(drop_cols, inplace=True, axis='columns')
tend_grpby = timer()
time_elapsed_grpby = timedelta(seconds=tend_grpby-tstart_grpby).total_seconds()
print(time_elapsed_grpby)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.