Pandas 通过在不使用 iterrows 的情况下查询其他数据帧来创建新的 dataframe

Question

I have two huge dataframes that both have the same id field.我有两个巨大的数据框，它们都具有相同的 id 字段。 I want to make a simple summary dataframe where I show the maximum of specific columns.我想做一个简单的总结 dataframe ，其中我显示了特定列的最大值。 I understand iterrows() is frowned upon, so are a couple one-liners to do this?我知道iterrows()不受欢迎，那么有几个单行代码可以做到这一点吗？ I don't understand lambda/apply very well, but maybe this would work here.我不太了解 lambda/apply，但也许这可以在这里工作。

Stand-alone example独立示例

import pandas as pd

myid = [1,1,2,3,4,4,5]
name =['A','A','B','C','D','D','E']
x = [15,12,3,3,1,4,8]
df1 = pd.DataFrame(list(zip(myid, name, x)), 
                  columns=['myid', 'name', 'x'])
display(df1)

myid = [1,2,2,2,3,4,5,5]
name =['A','B','B','B','C','D','E','E']
y = [9,6,3,4,6,2,8,2]
df2 = pd.DataFrame(list(zip(myid, name, y)), 
                  columns=['myid', 'name', 'y'])
display(df2)

mylist = df['myid'].unique()
df_summary = pd.DataFrame(mylist, columns=['MY_ID'])
## do work here...

Desired output所需 output

Answer 1

you can try concat+groupby.max你可以试试concat+groupby.max

out = (pd.concat((df1,df2),sort=False).groupby(['myid','name']).max()
         .add_prefix("Max_").reset_index())

   myid name  Max_x  Max_y
0     1    A   15.0    9.0
1     2    B    3.0    6.0
2     3    C    3.0    6.0
3     4    D    4.0    2.0
4     5    E    8.0    8.0

Answer 2

merge()
named aggregations命名聚合

df1.merge(df2, on=["myid","name"], how="outer")\
.groupby(["myid","name"], as_index=False).agg(MAX_X=("x","max"),MAX_Y=("y","max"))

	myid我的身份	name姓名	MAX_X MAX_X	MAX_Y MAX_Y
0 0	1 1	A一个	15 15	9 9
1 1	2 2	B乙	3 3	6 6
2 2	3 3	C C	3 3	6 6
3 3	4 4	D D	4 4	2 2
4 4	5 5	E乙	8 8	8 8

updated更新

you have noted that your data frames are large and solution is giving you OOM您已经注意到您的数据框很大，解决方案给您带来了 OOM
logically aggregate first, then merge will use less memory首先逻辑聚合，然后合并将使用较少的 memory

pd.merge(
    df1.groupby(["myid","name"],as_index=False).agg(MAX_X=("x","max")),
    df2.groupby(["myid","name"],as_index=False).agg(MAX_Y=("y","max")),
    on=["myid","name"]
)

Pandas 通过在不使用 iterrows 的情况下查询其他数据帧来创建新的 dataframe

问题描述

2 个解决方案

解决方案1
2 2021-02-19 18:54:38

解决方案2
2 已采纳 2021-02-19 18:59:39

updated更新

Pandas 通过在不使用 iterrows 的情况下查询其他数据帧来创建新的 dataframe

问题描述

2 个解决方案

解决方案1 2 2021-02-19 18:54:38

解决方案2 2 已采纳 2021-02-19 18:59:39

updated更新

解决方案1
2 2021-02-19 18:54:38

解决方案2
2 已采纳 2021-02-19 18:59:39