[英]How to use pandas groupby on one column with agg - max one col, min another col - without producing multi-level columns
我有以下 pandas DataFrame:
account_number = [1234, 5678, 9012, 1234.0, 5678, 9012, 1234.0, 5678, 9012, 1234.0, 5678, 9012]
client_name = ["Ford", "GM", "Honda", "Ford", "GM", "Honda", "Ford", "GM", "Honda", "Ford", "GM", "Honda"]
database = ["DB_Ford", "DB_GM", "DB_Honda", "DB_Ford", "DB_GM", "DB_Honda", "DB_Ford", "DB_GM", "DB_Honda", "DB_Ford", "DB_GM", "DB_Honda"]
server = ["L01SQL04", "L01SQL08", "L01SQL12", "L01SQL04", "L01SQL08", "L01SQL12", "L01SQL04", "L01SQL08", "L01SQL12", "L01SQL04", "L01SQL08", "L01SQL12"]
order_num = [2145479, 2145506, 2145534, 2145603, 2145658, 2429513, 2145489, 2145516, 2145544, 2145499, 2145526, 2145554]
customer_dob = ["1967-12-01", "1963-07-09", "1986-12-05", "1967-11-01", None, "1986-12-05", "1967-12-01", "1963-07-09", "1986-12-05", "1967-12-01", "1963-07-09", "1986-12-04"]
purchase_date = ["2022-06-18", "2022-04-11", "2021-01-18", "2022-06-20", "2022-04-11", "2021-01-18", "2022-06-22", "2022-04-13", "2021-01-18", "2022-06-24", "2022-04-18", "2021-01-18"]
d = {
"account_number": account_number,
"client_name" : client_name,
"database" : database,
"server" : server,
"order_num" : order_num,
"customer_dob" : customer_dob,
"purchase_date" : purchase_date,
}
df = pd.DataFrame(data=d)
dates = ["customer_dob", "purchase_date"]
for date in dates:
df[date] = pd.to_datetime(df[date])
每个 account_number 客户的出生日期 (DOB) 和购买日期 (PD) 应该是相同的,但是由于任何一个都可能存在数据输入错误,我想对 account_number 执行 groupby 并获得 DOB 的最大值,以及 PD 上的最小值。 如果除了 account_number 之外我想要的只是这两列,这很容易做到:
result = df.groupby("account_number").agg({"customer_dob": "max", "purchase_date": "min"}).reset_index()
result
但是,我也想要其他列,因为它们保证对于每个 account_number 都是相同的。 问题是,当我尝试包含其他列时,我得到了我不想要的多级列。 第一次尝试不仅产生了多级列,而且我什至看不到 DOB 和 PD 的实际值
result = df.groupby("account_number")["client_name", "database", "server", "order_num"].agg({"customer_dob": "max", "purchase_date": "min"}).reset_index()
result
第二次尝试包括 DOB 和 PD,但现在每个帐号两次,同时仍生成多级列:
result = df.groupby("account_number")["client_name", "database", "server", "order_num", "customer_dob", "purchase_date"].agg(
{"patient_dob": "max", "insert_date": "min"}).reset_index()
result
我只希望最终结果看起来像这样:
所以,这是我对你们所有 Python 专家的问题:我需要做什么才能做到这一点?
根据您上面的评论,留下订单号。 如果一个帐户的订单号相同,则将订单号添加到合并中的列列表中
result = df.groupby("account_number").agg({"customer_dob": "max", "purchase_date": "min"}).reset_index()
result.merge(df[['account_number','client_name','database','server' ]] ,
how='left',
on='account_number').drop_duplicates()
account_number customer_dob purchase_date client_name database server
0 1234.0 1967-12-01 2022-06-18 Ford DB_Ford L01SQL04
4 5678.0 1963-07-09 2022-04-11 GM DB_GM L01SQL08
8 9012.0 1986-12-05 2021-01-18 Honda DB_Honda L01SQL12
您可以使用:
agg(new_col_1=(col_1, 'sum'), new_col_2=(col_2, 'min'))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.