如何在熊猫数据框中将列表作为输入作为groupby函数的输入

Question

Suppose a subset of a dataset comprises these 2 columns, 假设数据集的子集包含这2列，

     attacker_king              attacker_commander
0   Joffrey/Tommen Baratheon    Jaime Lannister
1   Joffrey/Tommen Baratheon    Gregor Clegane
2   Joffrey/Tommen Baratheon    Jaime Lannister, Andros Brax
3   Robb Stark                  Roose Bolton, Wylis Manderly, Medger Cerwyn
4   Robb Stark                  Robb Stark, Brynden Tully
5   Robb Stark                  Robb Stark, Tytos Blackwood, Brynden Tully

My objective is to get the 'set of commanders' that each king deploys, as per the dataset. 我的目标是根据数据集获取每位国王部署的“指挥官”。

[x for x in battles['attacker_commander'].dropna().str.split(',').sum()]

The above command obtains only comma separated list of commanders But if I choose to use the following list comprehension, 上面的命令仅获取逗号分隔的命令列表，但是如果我选择使用以下列表理解，

battles[['attacker_commander','attacker_king']].groupby('attacker_king').sum()

I get an output where 我得到的输出

attacker_king                      attacker_commander   
Balon/Euron Greyjoy         Victarion GreyjoyAsha GreyjoyTheon GreyjoyTheo...
Joffrey/Tommen Baratheon    Jaime LannisterGregor CleganeJaime Lannister, ...
Robb Stark                  Roose Bolton, Wylis Manderly, Medger Cerwyn, H...
Stannis Baratheon           Stannis Baratheon, Davos SeaworthStannis Barat...

The problem with this approach is, suppose a row has just 1 commander ,when that is summed with next row, output can look like 'Victarion GreyjoyAsha Greyjoy' instead of 'Victarion Greyjoy,Asha Greyjoy'. 这种方法的问题是，假设一行只有1个指挥官，当与下一行相加时，输出看起来像是“ Victarion GreyjoyAsha Greyjoy”而不是“ Victarion Greyjoy，Asha Greyjoy”。 So does it make sense to use the list created using 所以使用使用创建的列表有意义吗

[x for x in battles['attacker_commander'].dropna().str.split(',').sum()]

and feed it to a groupby('attacker_king') or what approach do you folks suggest? 并将其提供给groupby（'attacker_king'）或您建议采用哪种方法？

Answer 1

I think you need apply with function join first: 我认为您需要先使用函数join apply ：

battles.groupby('attacker_king')['attacker_commander'].apply(','.join)

If need remove NaN : 如果需要删除NaN ：

battles.groupby('attacker_king')['attacker_commander'].apply(lambda x: ','.join(x.dropna()))

Then split and use set for unique values: 然后split并使用set作为唯一值：

df = battles.groupby('attacker_king')['attacker_commander']
            .apply(lambda x: list(set(','.join(x.dropna()).split(','))))
print (df)

The best solution for debugging is use custom function and then rewrite code to lambda : 调试的最佳解决方案是使用自定义函数，然后将代码重写为lambda ：

def f(x):
    #Series by attacker_commander per group
    print (x)
    #first remove NaN
    print (x.dropna())
    #join by ,
    print (','.join(x.dropna()))
    #create list by split
    print (','.join(x.dropna()).split(','))
    #convert to set - unique values
    print (set(','.join(x.dropna()).split(',')))
    #set convert to list
    print (list(set(','.join(x.dropna()).split(','))))
    return list(set(','.join(x.dropna()).split(',')))

df = battles.groupby('attacker_king')['attacker_commander'].apply(f)
print (df)

But also one posssible solution is remove rows with NaN by column DataFrame.dropna first: 但是还有一个可能的解决方案是首先通过DataFrame.dropna列删除带有NaN的行：

def f(x):
    return list(set(','.join(x).split(',')))

df = battles.dropna(subset=['attacker_commander']).groupby('attacker_king')['attacker_commander'].apply(f)
print (df)

Answer 2

you want to join the strings by groups then split and find the unique values. 您想按组加入字符串，然后拆分并找到唯一值。

df.groupby(
    'attacker_king'
).attacker_commander.apply(','.join).str.split(',').apply(pd.unique)

attacker_king
Joffrey/Tommen Baratheon      [Jaime Lannister, Gregor Clegane,  Andros Brax]
Robb Stark                  [Roose Bolton,  Wylis Manderly,  Medger Cerwyn...
Name: attacker_commander, dtype: object

如何在熊猫数据框中将列表作为输入作为groupby函数的输入

问题描述

2 个解决方案

解决方案1
3 2017-02-13 07:06:26

解决方案2
1 2017-02-13 07:21:00

如何在熊猫数据框中将列表作为输入作为groupby函数的输入

问题描述

2 个解决方案

解决方案1 3 2017-02-13 07:06:26

解决方案2 1 2017-02-13 07:21:00

解决方案1
3 2017-02-13 07:06:26

解决方案2
1 2017-02-13 07:21:00