简体   繁体   English

Pandas-将列转换为(条件)聚合字符串

[英]Pandas- pivoting column into (conditional) aggregated string

Lets say I have the following data set, turned into a dataframe: 假设我有以下数据集,变成了数据帧:

data = [
    ['Job 1', datetime.date(2019, 6, 9), 'Jim', 'Tom'],
    ['Job 1', datetime.date(2019, 6, 9), 'Bill', 'Tom'],
    ['Job 1', datetime.date(2019, 6, 9), 'Tom', 'Tom'],
    ['Job 1', datetime.date(2019, 6, 10), 'Bill', None],
    ['Job 2', datetime.date(2019,6,10), 'Tom', 'Tom']
]
df = pd.DataFrame(data, columns=['Job', 'Date', 'Employee', 'Manager'])

This yields a dataframe that looks like: 这会产生一个如下所示的数据框:

     Job        Date Employee Manager
0  Job 1  2019-06-09      Jim     Tom
1  Job 1  2019-06-09     Bill     Tom
2  Job 1  2019-06-09      Tom     Tom
3  Job 1  2019-06-10     Bill    None
4  Job 2  2019-06-10      Tom     Tom

What I am trying to generate is a pivot on each unique Job/Date combo, with a column for Manager, and a column for a string with comma separated, non-manager employees. 我想要生成的是每个唯一的作业/日期组合的一个轴,一个是Manager列,一个是逗号分隔的非经理员工的字符串列。 A couple of things to assume: 有几件事要假设:

  1. All employee names are unique (I'll actually be using unique employee ids rather than names), and Managers are also "employees", so there will never be a case with an employee and a manager sharing the same name/id, but being different individuals. 所有员工姓名都是唯一的(我实际上会使用唯一的员工ID而不是姓名),而经理也是“员工”,所以永远不会有员工和经理共享相同名称/身份的情况,但是不同的人。
  2. A work crew can have a manager, or not (see row with id 3, for an example without) 工作人员可以有一个经理,或者没有经理(参见id为3的行,例如没有)
  3. A manager will always also be listed as an employee (see row with id 2 or 4) 经理也将始终列为员工(请参阅ID为2或4的行)
  4. A job could have a manager, with no additional employees (see row id 4) 一个工作可以有一个经理,没有额外的员工(参见第4行)

I'd like the resulting dataframe to look like: 我希望结果数据框看起来像:

     Job        Date  Manager     Employees
0  Job 1  2019-06-09      Tom     Jim, Bill
1  Job 1  2019-06-10     None          Bill
2  Job 2  2019-06-10      Tom          None

Which leads to my questions: 这引出了我的问题:

  1. Is there a way to do a ','.join like aggregation in a pandas pivot? 有没有办法做一个','。像pandas pivot中的聚合一样加入?
  2. Is there a way to make this aggregation conditional (exclude the name/id in the manager column) 有没有办法使这种聚合成为条件(在经理列中排除名称/ ID)

I suspect 1) is possible, and 2) might be more difficult. 我怀疑1)是可能的,2)可能更难。 If 2) is a no, I can get around it in other ways later in my code. 如果2)是no,我可以稍后在我的代码中以其他方式绕过它。

The tricky part here is removing the Manager from the Employee column. 这里棘手的部分是从Employee列中删除Manager。


u = df.melt(['Job', 'Date'])
f = u[~u.duplicated(['Job', 'Date', 'value'], keep='last')].astype(str)

f.pivot_table(
    index=['Job', 'Date'],
    columns='variable', values='value',
    aggfunc=','.join
).rename_axis(None, axis=1)

                  Employee Manager
Job   Date
Job 1 2019-06-09  Jim,Bill     Tom
      2019-06-10      Bill    None
Job 2 2019-06-10       NaN     Tom

Group to aggregate, then fix the Employees by removing the Manager and setting to None where appropriate. 要聚合的组,然后通过删除管理器并在适当的位置设置为“无”来修复“员工”。 Since the employees are unique, sets will work nicely here to remove the Manager. 由于员工是独一无二的,因此集合可以很好地删除管理器。

s = df.groupby(['Job', 'Date']).agg({'Manager': 'first', 'Employee': lambda x: set(x)})
s['Employee'] = [', '.join(x.difference({y})) for x,y in zip(s.Employee, s.Manager)]
s['Employee'] = s.Employee.replace({'': None})

                 Manager   Employee
Job   Date                         
Job 1 2019-06-09     Tom  Jim, Bill
      2019-06-10    None       Bill
Job 2 2019-06-10     Tom       None

I'm partial to building a dictionary up with the desired results and reconstructing the dataframe. 我倾向于用期望的结果构建一个字典并重建数据帧。

d = {}
for t in df.itertuples():
    d_ = d.setdefault((t.Job, t.Date), {})
    d_['Manager'] = t.Manager
    d_.setdefault('Employees', set()).add(t.Employee)

for k, v in d.items():
    v['Employees'] -= {v['Manager']}
    v['Employees'] = ', '.join(v['Employees'])

pd.DataFrame(d.values(), d).rename_axis(['Job', 'Date']).reset_index()

     Job       Date  Employees Manager
0  Job 1 2019-06-09  Bill, Jim     Tom
1  Job 1 2019-06-10       Bill    None
2  Job 2 2019-06-10                Tom

In your case try not using lambda transform + drop_duplicates 在你的情况下,尝试不使用lambda transform + drop_duplicates

df['Employee']=df['Employee'].mask(df['Employee'].eq(df.Manager)).dropna().groupby([df['Job'], df['Date']]).transform('unique').str.join(',')
df=df.drop_duplicates(['Job','Date'])
df
Out[745]: 
     Job        Date  Employee Manager
0  Job 1  2019-06-09  Jim,Bill     Tom
3  Job 1  2019-06-10      Bill    None
4  Job 2  2019-06-10       NaN     Tom

how about 怎么样

df.groupby(["Job","Date","Manager"]).apply( lambda x: ",".join(x.Employee))

this will find all unique sets of Job Date and Manager and put the employees together with "," into one string 这将找到所有独特的工作日期和经理,并将员工与“,”放在一个字符串中

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM