[英]How to transpose data based on multiple lookup criterias in pandas - python
我是一名薪资专家,经常遇到非常奇怪的报告,其中一名员工的姓名和身份证号以多行(A 列和 B 列)表示,相应的数据分布在许多列之间。 例子:
id#, Name, PTO Code, Accrued Amount $, Accrued Hours, Used Amount, Used Hours, LeftAmount, LeftHours
101, Empl1, NY Sick, 0, 112, 0, 56, 0, 56
101, Empl1, Plan1, TO Am, 3600, 0, 1500, 0, 2100, 0
101, Empl1, Plan1, PTO Hrs, 0, 240, 0, 100, 0, 140
101, Empl1, Plan2, PTO Am, 6000, 0, 6000, 0, 0, 0
101, Empl1, Plan2, PTO Hrs, 0, 400, 0, 400, 0, 0
201, Empl2,
等等...
这种报告让人头疼……我写了一个代码(看到时请不要笑)组织数据。 这是 output:
id#, Name, NYC Sick, NYC Sick NYC Sick Plan1 PTO Am Plan1 PTOAm Plan1 PTOAm Plan1PTO
Hours Accrued Hours Used Hours Left Accrued Used Left HrsAccrued
101 Empl1 112 56 56 3600 1500 2100 240
201 Empl2
等等...
我的目标已经完成,但如果能看到一些可以执行相同任务的一流(干)代码,那就太好了。 请在下面查看我的代码。
import pandas as pd
df = pd.read_excel('PTO report.xlsx')
创建一个新的数据框; 只留下 ID 列和 Name 列; 摆脱重复值。 将是一个 output 文件
df_new = df[['ID#', 'Name']].drop_duplicates()
添加新列
df_new[[
'NYC Sick Hours Accrued', 'NYC Sick Hours Used',
'NYC Sick Hours Left', 'Plan1 PTO Amount Accrued',
'Plan1 PTO Amount Used', 'Plan1 PTO Amount Left', 'Plan1 PTO Hours Accrued',
'Plan1 PTO Hours Used', 'Plan1 PTO Hours Left', 'Plan2 PTO Amount Accrued',
'Plan2 PTO Amount Used', 'Plan2 PTO Amount Left', 'Plan2 PTO Hours Accrued',
'Plan2 PTO Hours Used', 'Plan2 PTO Hours Left']] = 0
循环遍历每个员工并执行类似于 vlookup 操作。 在我看来,这部分代码必须改进。
for i in df_new['ID#']:
#NY Sick Hours Accrued ****************************************
filt = (df['PTO Code'] == 'NYC Sick Hours') & (df['ID#'] == i )
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'NYC Sick Hours Accrued'] = int(df.loc[filt,'Accrued Hours'])
#NY Sick Hours Used
filt = (df['PTO Code'] == 'NYC Sick Hours') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'NYC Sick Hours Used'] = int(df.loc[filt,'Used Hours'])
#NY Sick Hours Left
filt = (df['PTO Code'] == 'NYC Sick Hours') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'NYC Sick Hours Left'] = int(df.loc[filt,'Ending Balance Hours'])
#Plan1 PTO Amount Accrued *************************************
filt = (df['PTO Code'] == 'Plan1 PTO Amount') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan1 PTO Amount Accrued'] = int(df.loc[filt,'Accrued Amount $'])
#Plan1 PTO Amount Used
filt = (df['PTO Code'] == 'Plan1 PTO Amount') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan1 PTO Amount Used'] = int(df.loc[filt,'Used Amount $'])
#Plan1 PTO Amount Left
filt = (df['PTO Code'] == 'Plan1 PTO Amount') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan1 PTO Amount Left'] = int(df.loc[filt,'Ending Balance Amount $'])
#Plan1 PTO Hours Accrued
filt = (df['PTO Code'] == 'Plan1 PTO Hours') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan1 PTO Hours Accrued'] = int(df.loc[filt,'Accrued Hours'])
#Plan1 PTO Hours Used
filt = (df['PTO Code'] == 'Plan1 PTO Hours') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan1 PTO Hours Used'] = int(df.loc[filt,'Used Hours'])
#Plan1 PTO Hours Left
filt = (df['PTO Code'] == 'Plan1 PTO Hours') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan1 PTO Hours Left'] = int(df.loc[filt,'Ending Balance Hours'])
#Plan2 PTO Amount Accrued **************************************
filt = (df['PTO Code'] == 'Plan2 PTO Amount') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan2 PTO Amount Accrued'] = int(df.loc[filt,'Accrued Amount $'])
#Plan2 PTO Amount Used
filt = (df['PTO Code'] == 'Plan2 PTO Amount') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan2 PTO Amount Used'] = int(df.loc[filt,'Used Amount $'])
#Plan2 PTO Amount Left
filt = (df['PTO Code'] == 'Plan2 PTO Amount') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan2 PTO Amount Left'] = int(df.loc[filt,'Ending Balance Amount $'])
#Plan2 PTO Hours Accrued
filt = (df['PTO Code'] == 'Plan2 PTO Hours') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan2 PTO Hours Accrued'] = int(df.loc[filt,'Accrued Hours'])
#Plan2 PTO Hours Used
filt = (df['PTO Code'] == 'Plan2 PTO Hours') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan2 PTO Hours Used'] = int(df.loc[filt,'Used Hours'])
#Plan2 PTO Hours Left
filt = (df['PTO Code'] == 'Plan2 PTO Hours') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan2 PTO Hours Left'] = int(df.loc[filt,'Ending Balance Hours'])
导出为新文件
df_new.to_excel('PTO report transposed.xlsx', index = False)
我在 VBA excel 中写了一个宏,它做同样的工作。 我使用了 class object 而不是“vlookup”。 必须有一个简单的解决方案。 这是我的第一篇文章,所以如果我的问题不清楚或标题错误,请告诉我。
感谢您的时间!
使用DataFrame.set_index
和DataFrame.stack
重塑为MultiIndex Series
,然后使用Series.unstack
删除0
行,最后MultiIndex in columns
:
s = df.set_index(['id#','Name','PTO Code']).stack()
df1 = s[s.ne(0)].unstack([-2,-1])
#alternatives if duplicates
#df1 = s[s.ne(0)].groupby(level=[0,1,2,3]).sum().unstack([-2,-1])
df1.columns = df1.columns.map(lambda x: f'{x[0]} {x[1]}')
df1 = df1.reset_index()
print (df1)
id# Name NY Sick Accrued Hours NY Sick Used Hours NY Sick LeftHours \
0 101 Empl1 112 56 56
Plan1 TO Am Accrued Amount $ Plan1 TO Am Used Amount \
0 3600 1500
Plan1 TO Am LeftAmount Plan1 PTO Hrs Accrued Hours \
0 2100 240
Plan1 PTO Hrs Used Hours Plan1 PTO Hrs LeftHours \
0 100 140
Plan2 PTO Am Accrued Amount $ Plan2 PTO Am Used Amount \
0 6000 6000
Plan2 PTO Hrs Accrued Hours Plan2 PTO Hrs Used Hours
0 400 400
您想要的并不完全清楚,特别是因为您的列中存在潜在错误(PTO 列中的逗号相同?)。
无论如何,假设 PTO 代码是单列,这本质上是pivot
并删除了零/NaN
(df.replace(0, float('nan'))
.pivot(index=['id#','Name'], columns='PTO Code')
.dropna(how='all', axis=1)
)
Output:
Accrued Amount $ Accrued Hours Used Amount Used Hours LeftAmount LeftHours
PTO Code Plan1 TO Am Plan2 PTO Am NY Sick Plan PTO Hrs Plan2 PTO Hrs Plan1 TO Am Plan2 PTO Am NY Sick Plan PTO Hrs Plan2 PTO Hrs Plan1 TO Am NY Sick Plan PTO Hrs
id# Name
101 Empl1 3600.0 6000.0 112.0 240.0 400.0 1500.0 6000.0 56.0 100.0 400.0 2100.0 56.0 140.0
201 Empl2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.