简体   繁体   English

Pandas 通过在不使用 iterrows 的情况下查询其他数据帧来创建新的 dataframe

[英]Pandas create new dataframe by querying other dataframes without using iterrows

I have two huge dataframes that both have the same id field.我有两个巨大的数据框,它们都具有相同的 id 字段。 I want to make a simple summary dataframe where I show the maximum of specific columns.我想做一个简单的总结 dataframe ,其中我显示了特定列的最大值。 I understand iterrows() is frowned upon, so are a couple one-liners to do this?我知道iterrows()不受欢迎,那么有几个单行代码可以做到这一点吗? I don't understand lambda/apply very well, but maybe this would work here.我不太了解 lambda/apply,但也许这可以在这里工作。

Stand-alone example独立示例

import pandas as pd

myid = [1,1,2,3,4,4,5]
name =['A','A','B','C','D','D','E']
x = [15,12,3,3,1,4,8]
df1 = pd.DataFrame(list(zip(myid, name, x)), 
                  columns=['myid', 'name', 'x'])
display(df1)

myid = [1,2,2,2,3,4,5,5]
name =['A','B','B','B','C','D','E','E']
y = [9,6,3,4,6,2,8,2]
df2 = pd.DataFrame(list(zip(myid, name, y)), 
                  columns=['myid', 'name', 'y'])
display(df2)

mylist = df['myid'].unique()
df_summary = pd.DataFrame(mylist, columns=['MY_ID'])
## do work here...

在此处输入图像描述

Desired output所需 output

在此处输入图像描述

you can try concat+groupby.max你可以试试concat+groupby.max

out = (pd.concat((df1,df2),sort=False).groupby(['myid','name']).max()
         .add_prefix("Max_").reset_index())

   myid name  Max_x  Max_y
0     1    A   15.0    9.0
1     2    B    3.0    6.0
2     3    C    3.0    6.0
3     4    D    4.0    2.0
4     5    E    8.0    8.0
  • merge()
  • named aggregations命名聚合
df1.merge(df2, on=["myid","name"], how="outer")\
.groupby(["myid","name"], as_index=False).agg(MAX_X=("x","max"),MAX_Y=("y","max"))

myid我的身份 name姓名 MAX_X MAX_X MAX_Y MAX_Y
0 0 1 1 A一个 15 15 9 9
1 1 2 2 B 3 3 6 6
2 2 3 3 C C 3 3 6 6
3 3 4 4 D D 4 4 2 2
4 4 5 5 E 8 8 8 8

updated更新

  • you have noted that your data frames are large and solution is giving you OOM您已经注意到您的数据框很大,解决方案给您带来了 OOM
  • logically aggregate first, then merge will use less memory首先逻辑聚合,然后合并将使用较少的 memory
pd.merge(
    df1.groupby(["myid","name"],as_index=False).agg(MAX_X=("x","max")),
    df2.groupby(["myid","name"],as_index=False).agg(MAX_Y=("y","max")),
    on=["myid","name"]
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM