繁体   English   中英

合并两个数据框并保留公共值,同时保留基于另一列的值

[英]Merge two dataframes and keep the common values while retaining values based on another column

当我合并两个数据框时,它会保留左侧和右侧数据框的列,并附加 _x 和 _y 。 但我希望它成为一列并“合并”两列的值,以便:

  1. 当值相同时,它只会输入一个值
  2. 当值不同时,它会根据名为“日期”的另一列保留该值,并根据日期获取“最新”值。

我也尝试使用连接来做它,在这种情况下它确实“合并”了两列,但它似乎只是“附加”了两行。

例如,在下面的代码中,我想得到 output dataframe df_desired。 我怎么能得到那个?

import pandas as pd
import numpy as np

np.random.seed(30)

company1 = ('comA','comB','comC','comD')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[100,200,300,400]
df1['date'] = [20191231,20191231,20191001,20190931]
print("\ndf1:")
print(df1)

company2 = ('comC','comD','comE','comF')
df2 = pd.DataFrame(columns=None)
df2['company'] = company2
df2['clv']=[300,450,500,600]
df2['date'] = [20191231,20191231,20191231,20191231]

print("\ndf2:")
print(df2)

df_desired = pd.DataFrame(columns=None)
df_desired['company'] = ('comA','comB','comC','comD','comE','comF')
df_desired['clv']=[100,200,300,450,500,600]
df_desired['date'] = [20191231,20191231,20191231,20191231,20191231,20191231]
print("\ndf_desired:")
print(df_desired)

df_merge = pd.merge(df1,df2,left_on = 'company',
        right_on = 'company',how='outer')
print("\ndf_merge:")
print(df_merge)
# alternately
df_concat = pd.concat([df1, df2], ignore_index=True, sort=False)
print("\ndf_concat:")
print(df_concat)

一种方法是连接两个数据帧,然后按date对连接的concat进行升序排序,并根据公司删除重复条目(同时保留最新条目):

df = pd.concat([df1, df2])
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')
df = df.sort_values('date', na_position='first').drop_duplicates('company', keep='last', ignore_index=True)

结果:

  company  clv       date
0    comA  100 2019-12-31
1    comB  200 2019-12-31
2    comC  300 2019-12-31
3    comD  450 2019-12-31
4    comE  500 2019-12-31
5    comF  600 2019-12-31

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM