[英]Pandas how to merge duplicates rows into one but without modifying string columns
I got problems when processing data in an xlsx file.处理 xlsx 文件中的数据时遇到问题。 my leader told me first to merge the same rows' values according to uid.我的领导告诉我首先根据 uid 合并相同行的值。 then I should get the sum of "content post volume", "views" and "exploration volume" according to the column "Account type" which consists of types "content" and "social".那么我应该根据“内容”和“社交”类型组成的“帐户类型”列得到“内容发布量”、“浏览量”和“探索量”的总和。
For example, there are two rows recording the data of uid 1680260000, now merge them into one.比如有两行记录uid 1680260000的数据,现在合并为一行。 the later three columns should add, turn into 5364, 3710029, and 300478. But don't modify the former three columns.后面三列要加,变成5364、3710029、300478。但不要修改前面三列。 The user_level should keep it as F2.(order like F1, F2, F3, F4...prefer the front value) user_level 应保持为 F2。(顺序如 F1、F2、F3、F4 ......首选前值)
uid user_level Account type content post volum exploration volum Views
1680260000 F2 content 112 934318 118
2209220000 F1 social 628 147623896 20160351
1680260000 F5 content 5252 2775711 300290
5390800000 F3 content 127 1235530 8554
6017200000 F2 social 142 649046 43144
7054610000 F2 social 23 1226520 232074
1682390000 F2 content 1162 18025639 722
3136670000 F2 content 6 2123571 189379
3136670000 F6 content 0 6 0
5393860000 F2 social 60 3246476 17
6017200000 F3 content 677 8855471 277229
6017200000 F2 social 737 11854463 0
1685250000 F2 content 96 2211002 5942
The expected result:
uid user_level Account type content post volum exploration volum Views
1680260000 F2 content 5364 3710029 20160469
2209220000 F1 social 628 147623896 20160351
5390800000 F3 content 127 1235530 8554
6017200000 F2 social 1556 21358980 320373
7054610000 F2 social 23 1226520 232074
1682390000 F2 content 1162 18025639 722
3136670000 F2 content 6 2123577 189379
5393860000 F2 social 60 3246476 17
1685250000 F2 content 96 2211002 5942
Now the problem is if I use df.groupby("uid").sum(), the column "Account type"(which type is string) will add together too.现在的问题是,如果我使用 df.groupby("uid").sum(),“帐户类型”列(类型为字符串)也会加在一起。 This isn't what I want, because later I need to extract data depending on it.这不是我想要的,因为以后我需要根据它来提取数据。 For example, after I merge the rows with duplicated uid, I need to get rows which "Account type" is in ["F0", "F1", "F2"].例如,在合并具有重复 uid 的行后,我需要获取“帐户类型”在 [“F0”、“F1”、“F2”] 中的行。 But groupby will make cell value turn to "F1F3", "F4F1" so it's hard to distinguish it.但是 groupby 会使单元格值变成“F1F3”,“F4F1”,所以很难区分。 I did try to split the string when extracting, such as我确实尝试在提取时拆分字符串,例如
file[file.Account_social_type.str.split("F").isin(["1", "2", "3"])]
ps: after df.str.split("F"), the "F1F3" will turn to ["1", "3"]
but somehow at this time .str.split("F") won't work for every cell but the whole column!但不知何故,此时 .str.split("F") 不适用于每个单元格,而是整个列!
So in the last, I use a stupid method.所以最后,我使用了一种愚蠢的方法。 First I use首先我使用
taruid = file[file.uid.duplicated(keep="first")].uid.to_list()
# somehow this statement still left repeated value in list ^
taruid = list(set(taruid))
to get all repeated uid.获取所有重复的uid。 Then use然后使用
def changeOne(rows : pd.DataFrame):
rows = rows.sort_values(by = "F_level")
rows.content_post_volume.iloc[0] = rows["content_post_volume"].sum()/(92)
rows.views.iloc[0] = rows["views"].sum()/92
rows.exploration_volume.iloc[0] = rows["exploration_volume"].sum()/92
return rows
replaceOne : pd.DataFrame = pd.DataFrame()
for items in taruid:
goals = file[file.uid == items]
replaceOne = replaceOne.append(changeOne(goals.copy()).iloc[0])
to get the sum value of specified columns and store it in the first rows.获取指定列的总和值并将其存储在第一行中。 in the last use在最后一次使用中
file = file.drop_duplicates(subset = "uid", keep=False)
# drop all repeated rows
file = pd.concat([file, replaceOne], axis=0, ignore_index=True)
to get the final integrated data.得到最终的综合数据。 And the flaws are very significant, near 1500 data cost 3s.而且缺陷非常显着,近1500条数据耗时3s。 There must be a much easier and more efficient method to use groupby or some advanced pandas function to slove this problem.必须有一种更简单、更有效的方法来使用 groupby 或一些高级 pandas 函数来解决这个问题。
WHAT I WANNA ask is, how can we merge/add duplicated/specified rows without modify string column or the columns I specified.我想问的是,我们如何在不修改字符串列或我指定的列的情况下合并/添加重复/指定的行。
I spend half a day to optimize this but I failed, really appreciate it if you guys can figure it out.我花了半天时间优化这个,但我失败了,如果你们能解决的话,真的很感激。
The expected format is unclear, but you can use different functions to aggregate the data.预期的格式不清楚,但您可以使用不同的函数来聚合数据。
Let's form comma separated strings of the unique values for "user_level" and "Account type":让我们用逗号分隔“user_level”和“Account type”的唯一值:
string_agg = lambda s: ','.join(dict.fromkeys(s))
out = (df.groupby('uid', as_index=False)
.agg({'user_level': string_agg, 'Account type': string_agg,
'content post volum': 'sum',
'exploration volum': 'sum', 'Views': 'sum'})
)
Output:输出:
uid user_level Account type content post volum exploration volum Views
0 1680260000 F2,F5 content 5364 3710029 300408
1 1682390000 F2 content 1162 18025639 722
2 1685250000 F2 content 96 2211002 5942
3 2209220000 F1 social 628 147623896 20160351
4 3136670000 F2,F6 content 6 2123577 189379
5 5390800000 F3 content 127 1235530 8554
6 5393860000 F2 social 60 3246476
7 6017200000 F2,F3 social,content 1556 21358980 320373
8 7054610000 F2 social 23 1226520 232074
For the demo, here is an alternative aggregation function to count the duplicates:对于演示,这是一个计算重复项的替代聚合函数:
from collections import Counter
string_agg = lambda s: ','.join([f'{k}({v})' if v>1 else k for k,v in Counter(s).items()])
Output:输出:
uid user_level Account type content post volum exploration volum Views
0 1680260000 F2,F5 content(2) 5364 3710029 300408
1 1682390000 F2 content 1162 18025639 722
2 1685250000 F2 content 96 2211002 5942
3 2209220000 F1 social 628 147623896 20160351
4 3136670000 F2,F6 content(2) 6 2123577 189379
5 5390800000 F3 content 127
6 5393860000 F2 social 60 3246476 17
7 6017200000 F2(2),F3 social(2),content 1556 21358980 320373
8 7054610000 F2 social 23 1226520 232074
df = df.groupby(['uid','Account type', 'user_level '])[['content post volume', 'exploration volume', 'views']].sum()
df2 = df.groupby(['uid','Account type'])[['content post volume', 'exploration volume', 'views']].sum()
df = df.reset_index()
df = df.drop_duplicates('uid')
df = df[['uid', 'Account type', 'user_level ']]
df = df.merge(df2, on=['uid'])
df
output输出
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.