简体   繁体   中英

Pandas dataframe applying logic to columns calculations

Hi I have a huge dataframe with the following structure:

    ticker  calendar-date     last-update   Assets    Ebitda  .....
0   a       2001-06-30        2001-09-14    110       1000    .....
1   a       2001-09-30        2002-01-22    0         -8      .....
2   a       2001-09-30        2002-02-01    0         800     .....
3   a       2001-12-30        2002-03-06    120       0       .....
4   b       2001-06-30        2001-09-18    110       0       .....
5   b       2001-06-30        2001-09-27    110       30      .....
6   b       2001-09-30        2002-01-08    140       35      .....
7   b       2001-12-30        2002-03-08    120       40      .....
..

What I want is for each ticker: create new columns with % change in Assets and Ebitda from last calendar-date (t-1) and last calendar-date(t-2) for each row.

But here comes the problems:

1) As you can see calendar-date (by ticker) are not always uniques values since there can be more last-update for the same calendar-date but I always want the change since last calendar-date and not from last last-update.

2)there are rows with 0 values in that case I want to use the last observed value to calculate the %change. If I only had one stock that would be easy, I would just ffill the values, but since I have many tickers I cannot perform this operation safely since I could pad the value from ticker 'a' to ticker 'b' and that is not what I want

I guess this could be solved creating a function with if statements to handle data exceptions or maybe there is a good way to handle this inside pandas... maybe multi indexing?? the truth is that I have no idea on how to approach this task, anybody can help?

Thanks

Step 1
sort_values to ensure proper ordering for later manipulation

icols = ['ticker', 'calendar-date', 'last-update']
df.sort_values(icols, inplace=True)

Step 2
groupby 'ticker' and replace zeros and forward fill

vcols = ['Assets', 'Ebitda']
temp = df.groupby('ticker')[vcols].apply(lambda x: x.replace(0, np.nan).ffill())
d1 = df.assign(**temp.to_dict('list'))
d1

  ticker calendar-date last-update  Assets  Ebitda
0      a    2001-06-30  2001-09-14   110.0  1000.0
1      a    2001-09-30  2002-01-22   110.0    -8.0
2      a    2001-09-30  2002-02-01   110.0   800.0
3      a    2001-12-30  2002-03-06   120.0   800.0
4      b    2001-06-30  2001-09-18   110.0     NaN
5      b    2001-06-30  2001-09-27   110.0    30.0
6      b    2001-09-30  2002-01-08   140.0    35.0
7      b    2001-12-30  2002-03-08   120.0    40.0

NOTE: The first 'Ebitda' for 'b' is NaN because there was nothing to forward fill from.

Step 3
groupby ['ticker', 'calendar-date'] and grab the last column. Because we sorted above, the last row will be the most recently updated row.

d2 = d1.groupby(icols[:2])[vcols].last()

Step 4
groupby again, this time just by 'ticker' which is in the index of d2 , and take the pct_change

d3 = d2.groupby(level='ticker').pct_change()

Step 5
join back with df

df.join(d3, on=icols[:2], rsuffix='_pct')

  ticker calendar-date last-update  Assets  Ebitda  Assets_pct  Ebitda_pct
0      a    2001-06-30  2001-09-14     110    1000         NaN         NaN
1      a    2001-09-30  2002-01-22       0      -8    0.000000   -0.200000
2      a    2001-09-30  2002-02-01       0     800    0.000000   -0.200000
3      a    2001-12-30  2002-03-06     120       0    0.090909    0.000000
4      b    2001-06-30  2001-09-18     110       0         NaN         NaN
5      b    2001-06-30  2001-09-27     110      30         NaN         NaN
6      b    2001-09-30  2002-01-08     140      35    0.272727    0.166667
7      b    2001-12-30  2002-03-08     120      40   -0.142857    0.142857

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM