简体   繁体   English

pandas:通过在所有行(一列)和聚合函数中拆分字符串值来分组

[英]pandas: Group by splitting string value in all rows (a column) and aggregation function

If i have dataset like this:如果我有这样的数据集:

id   person_name                       salary
0    [alexander, william, smith]       45000
1    [smith, robert, gates]            65000
2    [bob, alexander]                  56000
3    [robert, william]                 80000
4    [alexander, gates]                70000

If we sum that salary column then we will get 316000如果我们将工资列相加,那么我们将得到 316000

I really want to know how much person who named 'alexander, smith, etc' (in distinct) makes in salary if we sum all of the salaries from its splitting name in this dataset (that contains same string value).我真的很想知道如果我们在这个数据集中(包含相同的字符串值)中将所有来自其拆分名称的工资相加,那么命名为“alexander、smith 等”(不同的)的人的工资是多少。

output:输出:

group               sum_salary
alexander           171000 #sum from id 0 + 2 + 4 (which contain 'alexander')
william             125000 #sum from id 0 + 3
smith               110000 #sum from id 0 + 1
robert              145000 #sum from id 1 + 3
gates               135000 #sum from id 1 + 4
bob                  56000 #sum from id 2

as we see the sum of sum_salary columns is not the same as the initial dataset.正如我们看到的 sum_salary 列的总和与初始数据集不同。 all because the function requires double counting.都是因为该功能需要重复计算。

I thought it seems familiar like string count, but what makes me confuse is the way we use aggregation function.我认为它看起来像字符串计数一样熟悉,但让我感到困惑的是我们使用聚合函数的方式。 I've tried creating a new list of distinct value in person_name columns, then stuck comes.我尝试在 person_name 列中创建一个新的不同值列表,然后卡住了。

Any help is appreciated, Thank you very much任何帮助表示赞赏,非常感谢

Solutions working with lists in column person_name :使用列person_name列表的解决方案:

#if necessary
#df['person_name'] = df['person_name'].str.strip('[]').str.split(', ')

print (type(df.loc[0, 'person_name']))
<class 'list'>

First idea is use defaultdict for store sum ed values in loop:第一个想法是使用defaultdict在循环中存储sum值:

from collections import defaultdict

d = defaultdict(int)
for p, s in zip(df['person_name'], df['salary']):
    for x in p:
        d[x] += int(s)

print (d)
defaultdict(<class 'int'>, {'alexander': 171000, 
                            'william': 125000, 
                            'smith': 110000, 
                            'robert': 145000, 
                            'gates': 135000, 
                            'bob': 56000})

And then:进而:

df1 = pd.DataFrame({'group':list(d.keys()),
                    'sum_salary':list(d.values())})
print (df1)
       group  sum_salary
0  alexander      171000
1    william      125000
2      smith      110000
3     robert      145000
4      gates      135000
5        bob       56000

Another solution with repeating values by length of lists and aggregate sum :另一种按列表长度和聚合sum重复值的解决方案:

from itertools import chain

df1 = pd.DataFrame({
    'group' : list(chain.from_iterable(df['person_name'].tolist())), 
    'sum_salary' : df['salary'].values.repeat(df['person_name'].str.len())
})

df2 = df1.groupby('group', as_index=False, sort=False)['sum_salary'].sum()
print (df2)
       group  sum_salary
0  alexander      171000
1    william      125000
2      smith      110000
3     robert      145000
4      gates      135000
5        bob       56000

Another sol:另一个溶胶:

df_new=(pd.DataFrame({'person_name':np.concatenate(df.person_name.values),
                  'salary':df.salary.repeat(df.person_name.str.len())}))
print(df_new.groupby('person_name')['salary'].sum().reset_index())


  person_name  salary
0   alexander  171000
1         bob   56000
2       gates  135000
3      robert  145000
4       smith  110000
5     william  125000

Can be done concisely with dummies though performance will suffer due to all of the .str methods:可以用简明做dummies ,虽然性能将受到影响,由于所有的.str方法:

df.person_name.str.join('*').str.get_dummies('*').multiply(df.salary, 0).sum()

#alexander    171000
#bob           56000
#gates        135000
#robert       145000
#smith        110000
#william      125000
#dtype: int64

I parsed this as strings of lists, by copying OP's data and using pandas.read_clipboard() .我通过复制 OP 的数据并使用pandas.read_clipboard()将其解析为列表字符串。 In case this was indeed the case (a series of strings of lists), this solution would work:如果确实如此(一系列列表字符串),则此解决方案将起作用:

df = df.merge(df.person_name.str.split(',', expand=True), left_index=True, right_index=True)
df = df[[0, 1, 2, 'salary']].melt(id_vars = 'salary').drop(columns='variable')

# Some cleaning up, then a simple groupby
df.value = df.value.str.replace('[', '')
df.value = df.value.str.replace(']', '')
df.value = df.value.str.replace(' ', '')
df.groupby('value')['salary'].sum()

Output:输出:

value
alexander    171000
bob           56000
gates        135000
robert       145000
smith        110000
william      125000

Another way you can do this is with iterrows() .另一种方法是使用iterrows() This will not be as fast jezraels solution.这不会像 jezraels 解决方案那样快。 But it works:但它有效:

ids = []
names = []
salarys = []

# Iterate over the rows and extract the names from the lists in person_name column
for ix, row in df.iterrows():
    for name in row['person_name']:
        ids.append(row['id'])
        names.append(name)
        salarys.append(row['salary'])

# Create a new 'unnested' dataframe
df_new = pd.DataFrame({'id':ids,
                       'names':names,
                       'salary':salarys})

# Groupby on person_name and get the sum
print(df_new.groupby('names').salary.sum().reset_index())

Output输出

       names  salary
0  alexander  171000
1        bob   56000
2      gates  135000
3     robert  145000
4      smith  110000
5    william  125000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pandas:按功能分组和聚合 - Pandas: Group by and aggregation with function 在给定列中拆分字符串值(熊猫) - Splitting string value in given column (Pandas) 从列中的所有行中删除 pandas 字符串值并将其转换为 DateTime - Deleting pandas string value from all rows in a column and convert it to DateTime 熊猫在所有行的新列中用数字值替换特定的字符串 - pandas replace specific string with numeric value in a new column for all rows 将pandas列拆分为多行,其中拆分为另一列的值 - Split pandas column into multiple rows, where splitting is on the value of another column 合并熊猫组中一列中所有行的文本 - Merge text of all rows in a column in pandas group by pandas:如果组的最后一行具有特定的列值,如何删除组的所有行 - pandas: how to drop all rows of a group if the last row of the group has certain column value 熊猫将行拆分为4个不同的行,同时将列字符串拆分为4个 - Pandas split a row into 4 distinct rows while splitting a column string in 4 在 pandas 的同一列中将值从一行拆分到其他行 - Splitting value from one row to other rows in the same column in pandas 您可以使用 Pandas 使用 Python 将多行按列值分组为一行吗? - Can you group multiple rows all into one row by column value with Python using pandas?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM