[英]Pandas create new dataframe by querying other dataframes without using iterrows
I have two huge dataframes that both have the same id field.我有两个巨大的数据框,它们都具有相同的 id 字段。 I want to make a simple summary dataframe where I show the maximum of specific columns.我想做一个简单的总结 dataframe ,其中我显示了特定列的最大值。 I understand iterrows()
is frowned upon, so are a couple one-liners to do this?我知道iterrows()
不受欢迎,那么有几个单行代码可以做到这一点吗? I don't understand lambda/apply very well, but maybe this would work here.我不太了解 lambda/apply,但也许这可以在这里工作。
Stand-alone example独立示例
import pandas as pd
myid = [1,1,2,3,4,4,5]
name =['A','A','B','C','D','D','E']
x = [15,12,3,3,1,4,8]
df1 = pd.DataFrame(list(zip(myid, name, x)),
columns=['myid', 'name', 'x'])
display(df1)
myid = [1,2,2,2,3,4,5,5]
name =['A','B','B','B','C','D','E','E']
y = [9,6,3,4,6,2,8,2]
df2 = pd.DataFrame(list(zip(myid, name, y)),
columns=['myid', 'name', 'y'])
display(df2)
mylist = df['myid'].unique()
df_summary = pd.DataFrame(mylist, columns=['MY_ID'])
## do work here...
Desired output所需 output
you can try concat+groupby.max
你可以试试concat+groupby.max
out = (pd.concat((df1,df2),sort=False).groupby(['myid','name']).max()
.add_prefix("Max_").reset_index())
myid name Max_x Max_y
0 1 A 15.0 9.0
1 2 B 3.0 6.0
2 3 C 3.0 6.0
3 4 D 4.0 2.0
4 5 E 8.0 8.0
merge()
df1.merge(df2, on=["myid","name"], how="outer")\
.groupby(["myid","name"], as_index=False).agg(MAX_X=("x","max"),MAX_Y=("y","max"))
myid我的身份 | name姓名 | MAX_X MAX_X | MAX_Y MAX_Y | |
---|---|---|---|---|
0 0 | 1 1 | A一个 | 15 15 | 9 9 |
1 1 | 2 2 | B乙 | 3 3 | 6 6 |
2 2 | 3 3 | C C | 3 3 | 6 6 |
3 3 | 4 4 | D D | 4 4 | 2 2 |
4 4 | 5 5 | E乙 | 8 8 | 8 8 |
pd.merge(
df1.groupby(["myid","name"],as_index=False).agg(MAX_X=("x","max")),
df2.groupby(["myid","name"],as_index=False).agg(MAX_Y=("y","max")),
on=["myid","name"]
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.