繁体   English   中英

如何根据 pandas - python 中的多个查找条件转置数据

[英]How to transpose data based on multiple lookup criterias in pandas - python

我是一名薪资专家,经常遇到非常奇怪的报告,其中一名员工的姓名和身份证号以多行(A 列和 B 列)表示,相应的数据分布在许多列之间。 例子:

id#, Name, PTO Code, Accrued Amount $, Accrued Hours, Used Amount, Used Hours, LeftAmount, LeftHours
101, Empl1, NY Sick,         0,            112,          0,         56,          0,           56
101, Empl1, Plan1, TO Am,    3600,          0,         1500,          0,         2100,        0               
101, Empl1, Plan1, PTO Hrs,   0,           240,          0,         100,           0,         140
101, Empl1, Plan2, PTO Am,   6000,          0,         6000,         0,            0,          0
101, Empl1, Plan2, PTO Hrs,    0,         400,           0,           400,          0,         0
201, Empl2, 

等等...

这种报告让人头疼……我写了一个代码(看到时请不要笑)组织数据。 这是 output:

id#, Name, NYC Sick,       NYC Sick     NYC Sick    Plan1 PTO Am  Plan1 PTOAm  Plan1 PTOAm  Plan1PTO 
           Hours Accrued  Hours Used   Hours Left      Accrued       Used         Left    HrsAccrued
101  Empl1    112              56          56            3600         1500         2100       240
201  Empl2 

等等...

我的目标已经完成,但如果能看到一些可以执行相同任务的一流(干)代码,那就太好了。 请在下面查看我的代码。

import pandas as pd
df = pd.read_excel('PTO report.xlsx')

创建一个新的数据框; 只留下 ID 列和 Name 列; 摆脱重复值。 将是一个 output 文件

 df_new = df[['ID#', 'Name']].drop_duplicates()

添加新列

df_new[[
    'NYC Sick Hours Accrued', 'NYC Sick Hours Used', 
    'NYC Sick Hours Left', 'Plan1 PTO Amount Accrued', 
    'Plan1 PTO Amount Used', 'Plan1 PTO Amount Left', 'Plan1 PTO Hours Accrued', 
    'Plan1 PTO Hours Used', 'Plan1 PTO Hours Left', 'Plan2 PTO Amount Accrued', 
    'Plan2 PTO Amount Used', 'Plan2 PTO Amount Left', 'Plan2 PTO Hours Accrued', 
    'Plan2 PTO Hours Used', 'Plan2 PTO Hours Left']] = 0

循环遍历每个员工并执行类似于 vlookup 操作。 在我看来,这部分代码必须改进。

for i in df_new['ID#']:

    #NY Sick Hours Accrued ****************************************
    filt = (df['PTO Code'] == 'NYC Sick Hours') & (df['ID#'] == i )
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'NYC Sick Hours Accrued'] = int(df.loc[filt,'Accrued Hours'])
       
    #NY Sick Hours Used
    filt = (df['PTO Code'] == 'NYC Sick Hours') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'NYC Sick Hours Used'] = int(df.loc[filt,'Used Hours'])
    
    #NY Sick Hours Left
    filt = (df['PTO Code'] == 'NYC Sick Hours') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'NYC Sick Hours Left'] = int(df.loc[filt,'Ending Balance Hours'])
        
    #Plan1 PTO Amount Accrued *************************************
    filt = (df['PTO Code'] == 'Plan1 PTO Amount') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan1 PTO Amount Accrued'] = int(df.loc[filt,'Accrued Amount $'])
    
    #Plan1 PTO Amount Used
    filt = (df['PTO Code'] == 'Plan1 PTO Amount') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan1 PTO Amount Used'] = int(df.loc[filt,'Used Amount $'])
    
    #Plan1 PTO Amount Left 
    filt = (df['PTO Code'] == 'Plan1 PTO Amount') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan1 PTO Amount Left'] = int(df.loc[filt,'Ending Balance Amount $'])
    
    
    #Plan1 PTO Hours Accrued 
    filt = (df['PTO Code'] == 'Plan1 PTO Hours') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan1 PTO Hours Accrued'] = int(df.loc[filt,'Accrued Hours']) 
    
    #Plan1 PTO Hours Used
    filt = (df['PTO Code'] == 'Plan1 PTO Hours') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan1 PTO Hours Used'] = int(df.loc[filt,'Used Hours']) 
    
    #Plan1 PTO Hours Left
    filt = (df['PTO Code'] == 'Plan1 PTO Hours') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan1 PTO Hours Left'] = int(df.loc[filt,'Ending Balance Hours'])    

       
   #Plan2 PTO Amount Accrued **************************************
    filt = (df['PTO Code'] == 'Plan2 PTO Amount') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan2 PTO Amount Accrued'] = int(df.loc[filt,'Accrued Amount $'])
    
    #Plan2 PTO Amount Used
    filt = (df['PTO Code'] == 'Plan2 PTO Amount') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan2 PTO Amount Used'] = int(df.loc[filt,'Used Amount $'])
    
    #Plan2 PTO Amount Left 
    filt = (df['PTO Code'] == 'Plan2 PTO Amount') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan2 PTO Amount Left'] = int(df.loc[filt,'Ending Balance Amount $'])
     
    #Plan2 PTO Hours Accrued 
    filt = (df['PTO Code'] == 'Plan2 PTO Hours') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan2 PTO Hours Accrued'] = int(df.loc[filt,'Accrued Hours'])
    
    #Plan2 PTO Hours Used
    filt = (df['PTO Code'] == 'Plan2 PTO Hours') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan2 PTO Hours Used'] = int(df.loc[filt,'Used Hours'])
    
    #Plan2 PTO Hours Left
    filt = (df['PTO Code'] == 'Plan2 PTO Hours') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan2 PTO Hours Left'] = int(df.loc[filt,'Ending Balance Hours']) 

导出为新文件

df_new.to_excel('PTO report transposed.xlsx', index = False)

我在 VBA excel 中写了一个宏,它做同样的工作。 我使用了 class object 而不是“vlookup”。 必须有一个简单的解决方案。 这是我的第一篇文章,所以如果我的问题不清楚或标题错误,请告诉我。

感谢您的时间!

使用DataFrame.set_indexDataFrame.stack重塑为MultiIndex Series ,然后使用Series.unstack删除0行,最后MultiIndex in columns

s = df.set_index(['id#','Name','PTO Code']).stack()
df1 = s[s.ne(0)].unstack([-2,-1])
#alternatives if duplicates
#df1 = s[s.ne(0)].groupby(level=[0,1,2,3]).sum().unstack([-2,-1])
df1.columns = df1.columns.map(lambda x: f'{x[0]} {x[1]}')
df1 = df1.reset_index()

print (df1)
   id#   Name  NY Sick Accrued Hours  NY Sick Used Hours  NY Sick LeftHours  \
0  101  Empl1                    112                  56                 56   

   Plan1 TO Am Accrued Amount $  Plan1 TO Am Used Amount  \
0                          3600                     1500   

   Plan1 TO Am LeftAmount  Plan1 PTO Hrs Accrued Hours  \
0                    2100                          240   

   Plan1 PTO Hrs Used Hours  Plan1 PTO Hrs LeftHours  \
0                       100                      140   

   Plan2 PTO Am Accrued Amount $  Plan2 PTO Am Used Amount  \
0                           6000                      6000   

   Plan2 PTO Hrs Accrued Hours  Plan2 PTO Hrs Used Hours  
0                          400                       400  

您想要的并不完全清楚,特别是因为您的列中存在潜在错误(PTO 列中的逗号相同?)。

无论如何,假设 PTO 代码是单列,这本质上是pivot并删除了零/NaN

(df.replace(0, float('nan'))
   .pivot(index=['id#','Name'], columns='PTO Code')
   .dropna(how='all', axis=1)
 )

Output:

          Accrued Amount $              Accrued Hours                            Used Amount              Used Hours                             LeftAmount LeftHours             
PTO Code       Plan1 TO Am Plan2 PTO Am       NY Sick Plan PTO Hrs Plan2 PTO Hrs Plan1 TO Am Plan2 PTO Am    NY Sick Plan PTO Hrs Plan2 PTO Hrs Plan1 TO Am   NY Sick Plan PTO Hrs
id# Name                                                                                                                                                                          
101 Empl1           3600.0       6000.0         112.0        240.0         400.0      1500.0       6000.0       56.0        100.0         400.0      2100.0      56.0        140.0
201 Empl2              NaN          NaN           NaN          NaN           NaN         NaN          NaN        NaN          NaN           NaN         NaN       NaN          NaN

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM