简体   繁体   English

Pandas groupby +重新采样/ TimeGrouper,可在开始后的几个月内进行更改

[英]Pandas groupby + resample/TimeGrouper for change over months from start

I have a dataframe of employee salary data (sample as follows) where 'Date' refers to when the employee's salary became effective: 我有一个员工薪水数据的数据框(示例如下),“日期”是指员工薪水生效的时间:

Employee    Date        Salary
PersonA     1/1/2016    $50000 
PersonB     3/5/2014    $65000 
PersonB     3/1/2015    $75000 
PersonB     3/1/2016    $100000 
PersonC     5/15/2010   $75000 
PersonC     6/3/2011    $100000 
PersonC     3/10/2012   $110000 
PersonC     9/5/2012    $130000 
PersonC     3/1/2013    $150000 
PersonC     3/1/2014    $200000 

In this example, PersonA started this year at $50,000 and PersonC has been with the company for a while and has received several increases since his start on 5/15/2010. 在此示例中,PersonA今年的起价为50,000美元,PersonC在公司工作了一段时间,自2010年5月15日上任以来,获得了数次涨幅。

I need to convert the Date column to Months from Start , on an individual employee basis, where Months from Start will be in increments of m months (specified by me). 我需要根据个人雇员将“ Date列转换为“ Months from Start ,其中“ Months from Start将以m个月为增量(由我指定)。 For example, for PersonB, assuming m=12 , the result would be: 例如,对于PersonB,假设m=12 ,结果将是:

Employee    Months From Start   Salary
PersonB     0                   $65000 
PersonB     12                  $65000 
PersonB     24                  $75000 

This means that at month 0 (employment start), PersonB had a salary of $65,000; 这意味着在第0个月(开始就业),PersonB的工资为$ 65,000; 12 months later his salary was $65,000, and 24 months later his salary was $75,000. 12个月后,他的薪水为$ 65,000,而24个月后,他的薪水为$ 75,000。 Note that the next increment (36 months) would NOT appear on the transformed dataframe for PersonB because that duration exceeds the duration of PersonB's employment (it would be in the future). 请注意,下一个增量(36个月)将不会出现在PersonB的转换后的数据框中,因为该持续时间超过了PersonB的雇用期限(它将在将来)。

Note again that I want to be able to adjust m to any month increment. 再次注意,我希望能够将m调整为任何月份的增量。 If I wanted increments of 6 months ( m=6 ), the result would be: 如果我希望增加6个月( m=6 ),结果将是:

Employee    Months From Start   Salary
PersonB     0                   $65000 
PersonB     6                   $65000 
PersonB     12                  $65000 
PersonB     18                  $75000 
PersonB     24                  $100000 
PersonB     30                  $100000 

As a final step, I would also like to include the employee's salary as of today on the transformed dataframe. 最后,我还要将截至今天的员工薪水包括在转换后的数据框中。 Using PersonB again, and assuming m=6 , this means that the results would be: 再次使用PersonB并假设m=6 ,这意味着结果将是:

Employee    Months From Start   Salary
PersonB     0                   $65000 
PersonB     6                   $65000 
PersonB     12                  $65000 
PersonB     18                  $75000 
PersonB     24                  $100000 
PersonB     30                  $100000 
PersonB     32.92               $100000 <--added (today is 32.92 months from start)

Question: is there a programmatic way (I assume using at least one of: groupby , resample , or TimeGrouper ) to achieve the desired dataframe described above? 问题:是否有编程方式(我假设至少使用groupbyresampleTimeGrouper )来实现上述所需的数据帧?

Note: you can assume all employees are active (have not left the company). 注意:您可以假设所有员工都在职(尚未离开公司)。

You can combine group_by and resample to do it. 您可以结合使用group_by并重新采样来做到这一点。 To use resample, you need to have the date as index. 要使用重采样,您需要将日期作为索引。

df.index = pd.to_datetime(df.Date)
df.drop('Date',axis = 1, inplace = True)

Then: 然后:

df.groupby('Employee').resample('6m').pad()

In this case, I'm using 6 month periods. 在这种情况下,我使用的是6个月的期限。 Notice that it will get the last day of each month, I hope it's not gonna be a problem. 请注意,它将在每个月的最后一天,我希望这不会成为问题。 Then you will have: 然后您将拥有:

    Employee   Date      Salary
0   PersonA 2016-01-31   $50000
1   PersonB 2014-03-31   $65000
2   PersonB 2014-09-30   $65000
3   PersonB 2015-03-31   $75000
4   PersonB 2015-09-30   $75000
5   PersonB 2016-03-31  $100000
6   PersonC 2010-05-31   $75000
7   PersonC 2010-11-30   $75000
8   PersonC 2011-05-31   $75000
9   PersonC 2011-11-30  $100000
10  PersonC 2012-05-31  $110000
11  PersonC 2012-11-30  $130000
12  PersonC 2013-05-31  $150000
13  PersonC 2013-11-30  $150000
14  PersonC 2014-05-31  $200000

Now you can create the "months since started" column (cumcount function checks the order in which each row appears within its group). 现在,您可以创建“开始以来的月份数”列(计数功能检查每行在其组中的显示顺序)。 Remember to multiply it by the number of months you're using for each period (in this case, 6): 请记住将其乘以每个周期使用的月数(在本例中为6):

df['Months since started'] = df.groupby('Employee').cumcount()*6

     Employee   Date      Salary     Months since started
0   PersonA 2016-01-31   $50000                  0
1   PersonB 2014-03-31   $65000                  0
2   PersonB 2014-09-30   $65000                  6
3   PersonB 2015-03-31   $75000                 12
4   PersonB 2015-09-30   $75000                 18
5   PersonB 2016-03-31  $100000                 24
6   PersonC 2010-05-31   $75000                  0
7   PersonC 2010-11-30   $75000                  6
8   PersonC 2011-05-31   $75000                 12
9   PersonC 2011-11-30  $100000                 18
10  PersonC 2012-05-31  $110000                 24
11  PersonC 2012-11-30  $130000                 30
12  PersonC 2013-05-31  $150000                 36
13  PersonC 2013-11-30  $150000                 42
14  PersonC 2014-05-31  $200000                 48

Hope it helped! 希望能有所帮助!

You can use the groupby and merge functionalities of DataFrames 您可以使用groupbymerge的功能DataFrames

>>> import pandas as pd
>>> df = pd.DataFrame([['PersonC','5/15/2010',75000],['PersonC','7/3/2011',100000],['PersonB','3/5/2014',65000],['PersonB','3/1/2015',75000],['PersonB','3/1/2016',100000]],columns=['Employee','Date','Salary'])
>>> df['Date']= pd.to_datetime(df['Date'])
>>> df
  Employee       Date  Salary
0  PersonC 2010-05-15   75000
1  PersonC 2011-07-03  100000
2  PersonB 2014-03-05   65000
3  PersonB 2015-03-01   75000
4  PersonB 2016-03-01  100000
>>> satrt_date = df.groupby('Employee')['Date'].min().to_frame().rename(columns={'Date':'Start Date'})
>>> satrt_date['Employee'] = satrt_date.index 
>>> df = df.merge(satrt_date,how='left', on= 'Employee')
>>> df['Months From Start'] = df['Date']-df['Start Date']
>>> df['Months From Start'] = df['Months From Start'].apply(lambda x: x.days)
>>> df['Months From Start']= df['Months From Start'].apply(lambda x: (x/30) - (x/30)%6)
>>> df
  Employee       Date  Salary Start Date  Months From Start
0  PersonC 2010-05-15   75000 2010-05-15                  0
1  PersonC 2011-07-03  100000 2010-05-15                 12
2  PersonB 2014-03-05   65000 2014-03-05                  0
3  PersonB 2015-03-01   75000 2014-03-05                 12
4  PersonB 2016-03-01  100000 2014-03-05                 24

Here you can replace 6 with a variable called m and assign arbitrary values to it 在这里,您可以将6替换为名为m的变量,并为其分配任意值

OK, so for the first part of the answer I would do something like this... 好吧,所以对于答案的第一部分,我会做这样的事情...

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'Employee': ['PersonA', 'PersonB', 'PersonB', 'PersonB', 'PersonC', 'PersonC', 'PersonC', 'PersonC', 'PersonC', 'PersonC'], 
    'Date': ['1/1/2016', '3/5/2014', '3/1/2015', '3/1/2016', '5/15/2010', '6/3/2011', '3/10/2012', '9/5/2012', '3/1/2013', '3/1/2014'], 
    'Salary': [50000 , 65000 , 75000 , 100000 , 75000 , 100000 , 110000 , 130000 , 150000 , 200000]
})

df.Date = pd.to_datetime(df.Date)

m = 6
emp_groups = df.groupby('Employee')
df['months_from_start'] = df.Date - emp_groups.Date.transform(min)
df.months_from_start = df.months_from_start.dt.days / 30 // m * m

m can be whatever you want it to be. m可以是您想要的任何形式。 I am calculating the days between the min date then dividing by the approximate amount of days in a month and then doing a little bit of integer division to "round off" to the window size you want. 我正在计算min日期之间的天数,然后除以一个月中的大约天数,然后进行一些整数除法以“舍入”为所需的窗口大小。

This will give you something like this... 这会给你这样的东西...

        Date Employee  Salary  months_from_start
0 2016-01-01  PersonA   50000                  0
1 2014-03-05  PersonB   65000                  0
2 2015-03-01  PersonB   75000                 12
3 2016-03-01  PersonB  100000                 24
4 2010-05-15  PersonC   75000                  0
5 2011-06-03  PersonC  100000                 12
6 2012-03-10  PersonC  110000                 18
7 2012-09-05  PersonC  130000                 24
8 2013-03-01  PersonC  150000                 30
9 2014-03-01  PersonC  200000                 42

The second part is a little tricky. 第二部分有些棘手。 I would create a new df and concat to the first... 我要为第一个创建新的df和concat ...

last_date_df = emp_groups.last()
last_date_df.months_from_start = (last_date_df.Date - emp_groups.first().Date).dt.days / 30
last_date_df.reset_index(inplace=True)

pd.concat([df, last_date_df], axis=0)

getting you... 让你...

        Date Employee  Salary  months_from_start
0 2016-01-01  PersonA   50000           0.000000
1 2014-03-05  PersonB   65000           0.000000
2 2015-03-01  PersonB   75000          12.000000
3 2016-03-01  PersonB  100000          24.000000
4 2010-05-15  PersonC   75000           0.000000
5 2011-06-03  PersonC  100000          12.000000
6 2012-03-10  PersonC  110000          18.000000
7 2012-09-05  PersonC  130000          24.000000
8 2013-03-01  PersonC  150000          30.000000
9 2014-03-01  PersonC  200000          42.000000
0 2016-01-01  PersonA   50000           0.000000
1 2016-03-01  PersonB  100000          24.233333
2 2014-03-01  PersonC  200000          46.200000

Many thanks to the provided answers. 非常感谢提供的答案。 Unfortunately, all answers are a little 'off' and didn't quite achieve the goal. 不幸的是,所有答案都有些“偏离”,并没有完全达到目标。 I ended up nesting two for loops within list comprehensions to achieve the goal. 我最终在列表推导中嵌套了两个for循环以实现目标。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM