[英]How to feed a list as an input to a groupby function in pandas dataframe
假设数据集的子集包含这2列,
attacker_king attacker_commander
0 Joffrey/Tommen Baratheon Jaime Lannister
1 Joffrey/Tommen Baratheon Gregor Clegane
2 Joffrey/Tommen Baratheon Jaime Lannister, Andros Brax
3 Robb Stark Roose Bolton, Wylis Manderly, Medger Cerwyn
4 Robb Stark Robb Stark, Brynden Tully
5 Robb Stark Robb Stark, Tytos Blackwood, Brynden Tully
我的目标是根据数据集获取每位国王部署的“指挥官”。
[x for x in battles['attacker_commander'].dropna().str.split(',').sum()]
上面的命令仅获取逗号分隔的命令列表,但是如果我选择使用以下列表理解,
battles[['attacker_commander','attacker_king']].groupby('attacker_king').sum()
我得到的输出
attacker_king attacker_commander
Balon/Euron Greyjoy Victarion GreyjoyAsha GreyjoyTheon GreyjoyTheo...
Joffrey/Tommen Baratheon Jaime LannisterGregor CleganeJaime Lannister, ...
Robb Stark Roose Bolton, Wylis Manderly, Medger Cerwyn, H...
Stannis Baratheon Stannis Baratheon, Davos SeaworthStannis Barat...
这种方法的问题是,假设一行只有1个指挥官,当与下一行相加时,输出看起来像是“ Victarion GreyjoyAsha Greyjoy”而不是“ Victarion Greyjoy,Asha Greyjoy”。 所以使用使用创建的列表有意义吗
[x for x in battles['attacker_commander'].dropna().str.split(',').sum()]
并将其提供给groupby('attacker_king')或您建议采用哪种方法?
我认为您需要先使用函数join
apply
:
battles.groupby('attacker_king')['attacker_commander'].apply(','.join)
如果需要删除NaN
:
battles.groupby('attacker_king')['attacker_commander'].apply(lambda x: ','.join(x.dropna()))
然后split
并使用set
作为唯一值:
df = battles.groupby('attacker_king')['attacker_commander']
.apply(lambda x: list(set(','.join(x.dropna()).split(','))))
print (df)
调试的最佳解决方案是使用自定义函数,然后将代码重写为lambda
:
def f(x):
#Series by attacker_commander per group
print (x)
#first remove NaN
print (x.dropna())
#join by ,
print (','.join(x.dropna()))
#create list by split
print (','.join(x.dropna()).split(','))
#convert to set - unique values
print (set(','.join(x.dropna()).split(',')))
#set convert to list
print (list(set(','.join(x.dropna()).split(','))))
return list(set(','.join(x.dropna()).split(',')))
df = battles.groupby('attacker_king')['attacker_commander'].apply(f)
print (df)
但是还有一个可能的解决方案是首先通过DataFrame.dropna
列删除带有NaN
的行:
def f(x):
return list(set(','.join(x).split(',')))
df = battles.dropna(subset=['attacker_commander']).groupby('attacker_king')['attacker_commander'].apply(f)
print (df)
您想按组加入字符串,然后拆分并找到唯一值。
df.groupby(
'attacker_king'
).attacker_commander.apply(','.join).str.split(',').apply(pd.unique)
attacker_king
Joffrey/Tommen Baratheon [Jaime Lannister, Gregor Clegane, Andros Brax]
Robb Stark [Roose Bolton, Wylis Manderly, Medger Cerwyn...
Name: attacker_commander, dtype: object
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.