简体   繁体   中英

Pandas: combine two dataframes with same columns by picking values

I have two dataframes:

The first:

id  time_begin  time_end
0   1938    1946
1   1991    1991
2   1359    1991
4   1804    1937
6   1368    1949
... ... ...

Second:

id  time_begin  time_end
1   1946    1946
3   1940    1954
5   1804    1925
6   1978    1978
7   1912    1949

Now, I want to combine the two dataframes in such a way that I get all rows from both. But since sometimes the row will be present in both dataframes (eg row 1 and 6), I want to pick the minimum time_begin of the two, and the maximum time_end for the two. Thus my expected result:

id  time_begin  time_end
0   1938    1946
1   1946    1991
2   1359    1991
3   1940    1954
5   1804    1925
4   1804    1937
6   1368    1978
7   1912    1949
... ... ...

How can I achieve this? Normal join/combine operations do not allow for this as far as I can tell.

You could first merge the dataframes and then use groupby with agg in order to pick min(time_begin) and max(time_end)

df1=pd.DataFrame({'id':[0,1,2,4,6],'time_begin':[1938,1991,1359,1804,1368],'time_end': 
                       [1946,1991,1991,1937,1949]})
df2=pd.DataFrame({'id':[1,3,5,6,7],'time_begin':[1946,1940,1804,1978,1912],'time_end': 
                       [1946,1954,1925,1978,1949]})

#merge
df=df1.merge(df2,how='outer') 
#groupby
df=df.groupby('id').agg({'time_begin':'min','time_end':'max'})

Output:

在此处输入图片说明

诀窍是为每列定义不同的聚合函数:

pd.concat([df1, df2]).groupby('id').agg({'time_begin':'min', 'time_end':'max'})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM