简体   繁体   中英

Pandas: assigning columns with multiple conditions and date thresholds

Edited:

I have a financial portfolio in a pandas dataframe df, where the index is the date and I have multiple financial stocks per date.

Eg dataframe:

Date    Stock   Weight  Percentile  Final weight
1/1/2000    Apple   0.010   0.75    0.010
1/1/2000    IBM    0.011    0.4     0
1/1/2000    Google  0.012   0.45    0
1/1/2000    Nokia   0.022   0.81    0.022
2/1/2000    Apple   0.014   0.56    0
2/1/2000    Google  0.015   0.45    0
2/1/2000    Nokia   0.016   0.55    0
3/1/2000    Apple   0.020   0.52    0
3/1/2000    Google  0.030   0.51    0
3/1/2000    Nokia   0.040   0.47    0

I created Final_weight by doing assigning values of Weight whenever Percentile is greater than 0.7

Now I want this to be a bit more sophisticated, I still want Weight to be assigned to Final_weight when Percentile is > 0.7 , however after this date (at any point in the future), rather than become 0 when a stocks Percentile is not >0.7 , we would still get a weight as long as the Stocks Percentile is above 0.5 (ie holding the position for longer than just one day).

Then if the stock goes below 0.5 (in the near future) then Final_weight would become 0 .

Eg modified dataframe from above:

Date    Stock   Weight  Percentile  Final weight
1/1/2000    Apple   0.010   0.75    0.010
1/1/2000    IBM     0.011   0.4     0
1/1/2000    Google  0.012   0.45    0
1/1/2000    Nokia   0.022   0.81    0.022
2/1/2000    Apple   0.014   0.56    0.014
2/1/2000    Google  0.015   0.45    0
2/1/2000    Nokia   0.016   0.55    0.016
3/1/2000    Apple   0.020   0.52    0.020
3/1/2000    Google  0.030   0.51    0
3/1/2000    Nokia   0.040   0.47    0

Everyday the portfolios are different not always have the same stock from the day before.

This solution is more explicit and less pandas-esque, but it involves only a single pass through all rows without creating tons of temporary columns, and is therefore possibly faster. It needs an additional state variable, which I wrapped it into a closure for not having to make a class.

def closure():
    cur_weight = {}
    def func(x):
        if x["Percentile"] > 0.7:
            next_weight = x["Weight"]
        elif x["Percentile"] < 0.5 :
            next_weight = 0
        else:
            next_weight = x["Weight"] if cur_weight.get(x["Stock"], 0) > 0 else 0
        cur_weight[x["Stock"]] = next_weight
        return next_weight
    return func

df["FinalWeight"] = df.apply(closure(), axis=1)
  • I'd first put 'Stock' into the index
  • Then unstack to put them into the columns
  • I'd then split w for weights and p for percentiles
  • Then manipulate with a series of where

d1 = df.set_index('Stock', append=True)

d2 = d1.unstack()

w, p = d2.Weight, d2.Percentile

d1.join(w.where(p > .7, w.where((p.shift() > .7) & (p > .5), 0)).stack().rename('Final Weight'))

                   Weight  Percentile  Final Weight
Date       Stock                                   
2000-01-01 Apple    0.010        0.75         0.010
           IBM      0.011        0.40         0.000
           Google   0.012        0.45         0.000
           Nokia    0.022        0.81         0.022
2000-02-01 Apple    0.014        0.56         0.014
           Google   0.015        0.45         0.000
           Nokia    0.016        0.55         0.016

One method, avoiding loops and limited lookback periods.

Using your example:

import pandas as pd
import numpy as np


>>>df = pd.DataFrame([['1/1/2000',    'Apple',   0.010,   0.75],
                      ['1/1/2000',    'IBM',     0.011,    0.4],
                      ['1/1/2000',    'Google',  0.012,   0.45],
                      ['1/1/2000',    'Nokia',   0.022,   0.81],
                      ['2/1/2000',    'Apple',   0.014,   0.56],
                      ['2/1/2000',    'Google',  0.015,   0.45],
                      ['2/1/2000',    'Nokia',   0.016,   0.55],
                      ['3/1/2000',    'Apple',   0.020,   0.52],
                      ['3/1/2000',    'Google',  0.030,   0.51],
                      ['3/1/2000',    'Nokia',   0.040,   0.47]],
                     columns=['Date', 'Stock', 'Weight', 'Percentile'])

First, identify when stocks would start or stop being tracked in final weight:

>>>df['bought'] = np.where(df['Percentile'] >= 0.7, 1, np.nan)
>>>df['bought or sold'] = np.where(df['Percentile'] < 0.5, 0, df['bought'])

With '1' indicating a stock to buy, and '0' one to sell, if owned.

From this, you can identify whether the stock is owned. Note that this requires the dataframe already be sorted chronologically, if at any point you use it on a dataframe without a date index:

>>>df['own'] = df.groupby('Stock')['bought or sold'].fillna(method='ffill').fillna(0)

'ffill' is forward fill, propagating ownership status forward from buy and sell dates. .fillna(0) catches any stocks that have remained between .5 and .7 for the entirety of the dataframe. Then, calculate Final Weight

>>>df['Final Weight'] = df['own']*df['Weight']

Multiplication, with df['own'] being the identity or zero, is a little faster than another np.where and gives the same result.

Edit:

Since speed is a concern, doing everything in one column, as suggested by @cronos, does provide a speed boost, coming in around a 37% improvement at 20 rows in my tests, or 18% at 2,000,000. I could imagine the latter larger if storing the intermediate columns were to cross some sort of memory-usage threshold or there were something else involving system specifics I didn't experience.

This would look like:

>>>df['Final Weight'] = np.where(df['Percentile'] >= 0.7, 1, np.nan)
>>>df['Final Weight'] = np.where(df['Percentile'] < 0.5, 0, df['Final Weight'])
>>>df['Final Weight'] = df.groupby('Stock')['Final Weight'].fillna(method='ffill').fillna(0)
>>>df['Final Weight'] = df['Final Weight']*df['Weight']

Either using this method or deleting the intermediate fields would give result:

>>>df 
       Date   Stock  Weight  Percentile  Final Weight
0  1/1/2000   Apple   0.010        0.75         0.010
1  1/1/2000     IBM   0.011        0.40         0.000
2  1/1/2000  Google   0.012        0.45         0.000
3  1/1/2000   Nokia   0.022        0.81         0.022
4  2/1/2000   Apple   0.014        0.56         0.014
5  2/1/2000  Google   0.015        0.45         0.000
6  2/1/2000   Nokia   0.016        0.55         0.016
7  3/1/2000   Apple   0.020        0.52         0.020
8  3/1/2000  Google   0.030        0.51         0.000
9  3/1/2000   Nokia   0.040        0.47         0.000

For further improvement, I'd look at adding a way to set an initial condition that has stocks being owned, followed by breaking the dataframe down to look at smaller timeframes. This could be done by adding an initial condition for the time period covered by one of these smaller dataframes, then changing

>>>df['Final Weight'] = np.where(df['Percentile'] >= 0.7, 1, np.nan)

to something like

>>>df['Final Weight'] = np.where((df['Percentile'] >= 0.7) | (df['Final Weight'] != 0), 1, np.nan)

to allow that to be recognized and propagate.

Setup

Dataframe:

             Stock  Weight  Percentile  Finalweight
Date                                               
2000-01-01   Apple   0.010        0.75            0
2000-01-01     IBM   0.011        0.40            0
2000-01-01  Google   0.012        0.45            0
2000-01-01   Nokia   0.022        0.81            0
2000-02-01   Apple   0.014        0.56            0
2000-02-01  Google   0.015        0.45            0
2000-02-01   Nokia   0.016        0.55            0
2000-03-01   Apple   0.020        0.52            0
2000-03-01  Google   0.030        0.51            0
2000-03-01   Nokia   0.040        0.57            0

Solution

df = df.reset_index()
#find historical max percentile for a Stock
df['max_percentile'] = df.apply(lambda x: df[df.Stock==x.Stock].iloc[:x.name].Percentile.max() if x.name>0 else x.Percentile, axis=1)
#set weight according to max_percentile and the current percentile
df['Finalweight'] = df.apply(lambda x: x.Weight if (x.Percentile>0.7) or (x.Percentile>0.5 and x.max_percentile>0.7) else 0, axis=1)

Out[1041]: 
        Date   Stock  Weight  Percentile  Finalweight  max_percentile
0 2000-01-01   Apple   0.010        0.75        0.010            0.75
1 2000-01-01     IBM   0.011        0.40        0.000            0.40
2 2000-01-01  Google   0.012        0.45        0.000            0.45
3 2000-01-01   Nokia   0.022        0.81        0.022            0.81
4 2000-02-01   Apple   0.014        0.56        0.014            0.75
5 2000-02-01  Google   0.015        0.45        0.000            0.51
6 2000-02-01   Nokia   0.016        0.55        0.016            0.81
7 2000-03-01   Apple   0.020        0.52        0.020            0.75
8 2000-03-01  Google   0.030        0.51        0.000            0.51
9 2000-03-01   Nokia   0.040        0.57        0.040            0.81

Note

In the last row of your example data, Nokia's Percentile is 0.57 while in your results it becomes 0.47. In this example, I used 0.57 so the output is a bit different than yours for the last row.

I think you may want to use the pandas.Series rolling window method.

Perhaps something like this:

import pandas as pd

grouped = df.groupby('Stock')

df['MaxPercentileToDate'] = np.NaN
df.index = df['Date']

for name, group in grouped:
    df.loc[df.Stock==name, 'MaxPercentileToDate'] = group['Percentile'].rolling(min_periods=0, window=4).max()

# Mask selects rows that have ever been greater than 0.75 (including current row in max)
# and are currently greater than 0.5
mask = ((df['MaxPercentileToDate'] > 0.75) & (df['Percentile'] > 0.5))
df.loc[mask, 'Finalweight'] = df.loc[mask, 'Weight']

I believe this assumes values are sorted by date (which your initial dataset seems to have), and you would also have to adjust the min_periods parameter to be the max number of entries per stock.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM