简体   繁体   English

熊猫如何将重复的行合并为一个但不修改字符串列

[英]Pandas how to merge duplicates rows into one but without modifying string columns

I got problems when processing data in an xlsx file.处理 xlsx 文件中的数据时遇到问题。 my leader told me first to merge the same rows' values according to uid.我的领导告诉我首先根据 uid 合并相同行的值。 then I should get the sum of "content post volume", "views" and "exploration volume" according to the column "Account type" which consists of types "content" and "social".那么我应该根据“内容”和“社交”类型组成的“帐户类型”列得到“内容发布量”、“浏览量”和“探索量”的总和。

For example, there are two rows recording the data of uid 1680260000, now merge them into one.比如有两行记录uid 1680260000的数据,现在合并为一行。 the later three columns should add, turn into 5364, 3710029, and 300478. But don't modify the former three columns.后面三列要加,变成5364、3710029、300478。但不要修改前面三列。 The user_level should keep it as F2.(order like F1, F2, F3, F4...prefer the front value) user_level 应保持为 F2。(顺序如 F1、F2、F3、F4 ......首选前值)

uid          user_level    Account type   content post volum  exploration volum    Views
1680260000  F2                content          112                   934318            118
2209220000  F1                social           628                  147623896     20160351
1680260000  F5                content          5252                  2775711        300290
5390800000  F3                content          127                   1235530          8554
6017200000  F2                social           142                    649046         43144
7054610000  F2                social           23                    1226520        232074
1682390000  F2                content          1162                 18025639           722
3136670000  F2                content          6                    2123571         189379
3136670000  F6                content          0                         6              0
5393860000  F2                social           60                    3246476           17
6017200000  F3                content          677                   8855471        277229
6017200000  F2                social           737                  11854463             0
1685250000  F2                content          96                    2211002          5942

The expected result:
uid          user_level    Account type   content post volum    exploration volum    Views
1680260000  F2                content         5364                  3710029       20160469
2209220000  F1                social           628                  147623896     20160351
5390800000  F3                content          127                   1235530          8554
6017200000  F2                social           1556                  21358980       320373
7054610000  F2                social           23                    1226520        232074
1682390000  F2                content          1162                 18025639          722
3136670000  F2                content          6                    2123577         189379
5393860000  F2                social           60                    3246476           17
1685250000  F2                content          96                    2211002          5942

Now the problem is if I use df.groupby("uid").sum(), the column "Account type"(which type is string) will add together too.现在的问题是,如果我使用 df.groupby("uid").sum(),“帐户类型”列(类型为字符串)也会加在一起。 This isn't what I want, because later I need to extract data depending on it.这不是我想要的,因为以后我需要根据它来提取数据。 For example, after I merge the rows with duplicated uid, I need to get rows which "Account type" is in ["F0", "F1", "F2"].例如,在合并具有重复 uid 的行后,我需要获取“帐户类型”在 [“F0”、“F1”、“F2”] 中的行。 But groupby will make cell value turn to "F1F3", "F4F1" so it's hard to distinguish it.但是 groupby 会使单元格值变成“F1F3”,“F4F1”,所以很难区分。 I did try to split the string when extracting, such as我确实尝试在提取时拆分字符串,例如

file[file.Account_social_type.str.split("F").isin(["1", "2", "3"])]
ps: after df.str.split("F"), the "F1F3" will turn to ["1", "3"]

but somehow at this time .str.split("F") won't work for every cell but the whole column!但不知何故,此时 .str.split("F") 不适用于每个单元格,而是整个列!
So in the last, I use a stupid method.所以最后,我使用了一种愚蠢的方法。 First I use首先我使用

taruid = file[file.uid.duplicated(keep="first")].uid.to_list()
# somehow this statement still left repeated value in list ^
taruid = list(set(taruid))

to get all repeated uid.获取所有重复的uid。 Then use然后使用

def changeOne(rows : pd.DataFrame):
    rows = rows.sort_values(by = "F_level")
    rows.content_post_volume.iloc[0] = rows["content_post_volume"].sum()/(92)
    rows.views.iloc[0] = rows["views"].sum()/92
    rows.exploration_volume.iloc[0] = rows["exploration_volume"].sum()/92
    return rows

replaceOne : pd.DataFrame = pd.DataFrame()
for items in taruid:
    goals = file[file.uid == items]
    replaceOne = replaceOne.append(changeOne(goals.copy()).iloc[0])

to get the sum value of specified columns and store it in the first rows.获取指定列的总和值并将其存储在第一行中。 in the last use在最后一次使用中

file = file.drop_duplicates(subset = "uid", keep=False)
# drop all repeated rows
file = pd.concat([file, replaceOne], axis=0, ignore_index=True)

to get the final integrated data.得到最终的综合数据。 And the flaws are very significant, near 1500 data cost 3s.而且缺陷非常显着,近1500条数据耗时3s。 There must be a much easier and more efficient method to use groupby or some advanced pandas function to slove this problem.必须有一种更简单、更有效的方法来使用 groupby 或一些高级 pandas 函数来解决这个问题。
WHAT I WANNA ask is, how can we merge/add duplicated/specified rows without modify string column or the columns I specified.我想问的是,我们如何在不修改字符串列或我指定的列的情况下合并/添加重复/指定的行。
I spend half a day to optimize this but I failed, really appreciate it if you guys can figure it out.我花了半天时间优化这个,但我失败了,如果你们能解决的话,真的很感激。

The expected format is unclear, but you can use different functions to aggregate the data.预期的格式不清楚,但您可以使用不同的函数来聚合数据。

Let's form comma separated strings of the unique values for "user_level" and "Account type":让我们用逗号分隔“user_level”和“Account type”的唯一值:

string_agg = lambda s: ','.join(dict.fromkeys(s))

out = (df.groupby('uid', as_index=False)
         .agg({'user_level': string_agg, 'Account type': string_agg,
               'content post volum': 'sum',
       'exploration volum': 'sum', 'Views': 'sum'})
      )

Output:输出:

          uid user_level    Account type  content post volum  exploration volum     Views
0  1680260000      F2,F5         content                5364            3710029    300408
1  1682390000         F2         content                1162           18025639       722
2  1685250000         F2         content                  96            2211002      5942
3  2209220000         F1          social                 628          147623896  20160351
4  3136670000      F2,F6         content                   6            2123577    189379
5  5390800000         F3         content                 127            1235530      8554
6  5393860000         F2          social                  60            3246476
7  6017200000      F2,F3  social,content                1556           21358980    320373
8  7054610000         F2          social                  23            1226520    232074

For the demo, here is an alternative aggregation function to count the duplicates:对于演示,这是一个计算重复项的替代聚合函数:

from collections import Counter
string_agg = lambda s: ','.join([f'{k}({v})' if v>1 else k for k,v in Counter(s).items()])

Output:输出:

          uid user_level       Account type  content post volum  exploration volum     Views
0  1680260000      F2,F5         content(2)                5364            3710029    300408
1  1682390000         F2            content                1162           18025639       722
2  1685250000         F2            content                  96            2211002      5942
3  2209220000         F1             social                 628          147623896  20160351
4  3136670000      F2,F6         content(2)                   6            2123577    189379
5  5390800000         F3            content                 127    
6  5393860000         F2             social                  60            3246476        17
7  6017200000   F2(2),F3  social(2),content                1556           21358980    320373
8  7054610000         F2             social                  23            1226520    232074
df = df.groupby(['uid','Account type', 'user_level '])[['content post volume', 'exploration volume', 'views']].sum()
df2 = df.groupby(['uid','Account type'])[['content post volume', 'exploration volume', 'views']].sum()
df = df.reset_index()
df = df.drop_duplicates('uid')
df = df[['uid', 'Account type', 'user_level ']]
df = df.merge(df2, on=['uid'])
df

output输出

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM