Edited:
I have a financial portfolio in a pandas dataframe df, where the index is the date and I have multiple financial stocks per date.
Eg dataframe:
Date Stock Weight Percentile Final weight
1/1/2000 Apple 0.010 0.75 0.010
1/1/2000 IBM 0.011 0.4 0
1/1/2000 Google 0.012 0.45 0
1/1/2000 Nokia 0.022 0.81 0.022
2/1/2000 Apple 0.014 0.56 0
2/1/2000 Google 0.015 0.45 0
2/1/2000 Nokia 0.016 0.55 0
3/1/2000 Apple 0.020 0.52 0
3/1/2000 Google 0.030 0.51 0
3/1/2000 Nokia 0.040 0.47 0
I created Final_weight
by doing assigning values of Weight
whenever Percentile
is greater than 0.7
Now I want this to be a bit more sophisticated, I still want Weight
to be assigned to Final_weight
when Percentile is > 0.7
, however after this date (at any point in the future), rather than become 0 when a stocks Percentile
is not >0.7
, we would still get a weight as long as the Stocks Percentile
is above 0.5
(ie holding the position for longer than just one day).
Then if the stock goes below 0.5
(in the near future) then Final_weight would become 0
.
Eg modified dataframe from above:
Date Stock Weight Percentile Final weight
1/1/2000 Apple 0.010 0.75 0.010
1/1/2000 IBM 0.011 0.4 0
1/1/2000 Google 0.012 0.45 0
1/1/2000 Nokia 0.022 0.81 0.022
2/1/2000 Apple 0.014 0.56 0.014
2/1/2000 Google 0.015 0.45 0
2/1/2000 Nokia 0.016 0.55 0.016
3/1/2000 Apple 0.020 0.52 0.020
3/1/2000 Google 0.030 0.51 0
3/1/2000 Nokia 0.040 0.47 0
Everyday the portfolios are different not always have the same stock from the day before.
This solution is more explicit and less pandas-esque, but it involves only a single pass through all rows without creating tons of temporary columns, and is therefore possibly faster. It needs an additional state variable, which I wrapped it into a closure for not having to make a class.
def closure():
cur_weight = {}
def func(x):
if x["Percentile"] > 0.7:
next_weight = x["Weight"]
elif x["Percentile"] < 0.5 :
next_weight = 0
else:
next_weight = x["Weight"] if cur_weight.get(x["Stock"], 0) > 0 else 0
cur_weight[x["Stock"]] = next_weight
return next_weight
return func
df["FinalWeight"] = df.apply(closure(), axis=1)
'Stock'
into the index unstack
to put them into the columns w
for weights and p
for percentiles where
d1 = df.set_index('Stock', append=True)
d2 = d1.unstack()
w, p = d2.Weight, d2.Percentile
d1.join(w.where(p > .7, w.where((p.shift() > .7) & (p > .5), 0)).stack().rename('Final Weight'))
Weight Percentile Final Weight
Date Stock
2000-01-01 Apple 0.010 0.75 0.010
IBM 0.011 0.40 0.000
Google 0.012 0.45 0.000
Nokia 0.022 0.81 0.022
2000-02-01 Apple 0.014 0.56 0.014
Google 0.015 0.45 0.000
Nokia 0.016 0.55 0.016
One method, avoiding loops and limited lookback periods.
Using your example:
import pandas as pd
import numpy as np
>>>df = pd.DataFrame([['1/1/2000', 'Apple', 0.010, 0.75],
['1/1/2000', 'IBM', 0.011, 0.4],
['1/1/2000', 'Google', 0.012, 0.45],
['1/1/2000', 'Nokia', 0.022, 0.81],
['2/1/2000', 'Apple', 0.014, 0.56],
['2/1/2000', 'Google', 0.015, 0.45],
['2/1/2000', 'Nokia', 0.016, 0.55],
['3/1/2000', 'Apple', 0.020, 0.52],
['3/1/2000', 'Google', 0.030, 0.51],
['3/1/2000', 'Nokia', 0.040, 0.47]],
columns=['Date', 'Stock', 'Weight', 'Percentile'])
First, identify when stocks would start or stop being tracked in final weight:
>>>df['bought'] = np.where(df['Percentile'] >= 0.7, 1, np.nan)
>>>df['bought or sold'] = np.where(df['Percentile'] < 0.5, 0, df['bought'])
With '1' indicating a stock to buy, and '0' one to sell, if owned.
From this, you can identify whether the stock is owned. Note that this requires the dataframe already be sorted chronologically, if at any point you use it on a dataframe without a date index:
>>>df['own'] = df.groupby('Stock')['bought or sold'].fillna(method='ffill').fillna(0)
'ffill'
is forward fill, propagating ownership status forward from buy and sell dates. .fillna(0)
catches any stocks that have remained between .5 and .7 for the entirety of the dataframe. Then, calculate Final Weight
>>>df['Final Weight'] = df['own']*df['Weight']
Multiplication, with df['own']
being the identity or zero, is a little faster than another np.where and gives the same result.
Edit:
Since speed is a concern, doing everything in one column, as suggested by @cronos, does provide a speed boost, coming in around a 37% improvement at 20 rows in my tests, or 18% at 2,000,000. I could imagine the latter larger if storing the intermediate columns were to cross some sort of memory-usage threshold or there were something else involving system specifics I didn't experience.
This would look like:
>>>df['Final Weight'] = np.where(df['Percentile'] >= 0.7, 1, np.nan)
>>>df['Final Weight'] = np.where(df['Percentile'] < 0.5, 0, df['Final Weight'])
>>>df['Final Weight'] = df.groupby('Stock')['Final Weight'].fillna(method='ffill').fillna(0)
>>>df['Final Weight'] = df['Final Weight']*df['Weight']
Either using this method or deleting the intermediate fields would give result:
>>>df
Date Stock Weight Percentile Final Weight
0 1/1/2000 Apple 0.010 0.75 0.010
1 1/1/2000 IBM 0.011 0.40 0.000
2 1/1/2000 Google 0.012 0.45 0.000
3 1/1/2000 Nokia 0.022 0.81 0.022
4 2/1/2000 Apple 0.014 0.56 0.014
5 2/1/2000 Google 0.015 0.45 0.000
6 2/1/2000 Nokia 0.016 0.55 0.016
7 3/1/2000 Apple 0.020 0.52 0.020
8 3/1/2000 Google 0.030 0.51 0.000
9 3/1/2000 Nokia 0.040 0.47 0.000
For further improvement, I'd look at adding a way to set an initial condition that has stocks being owned, followed by breaking the dataframe down to look at smaller timeframes. This could be done by adding an initial condition for the time period covered by one of these smaller dataframes, then changing
>>>df['Final Weight'] = np.where(df['Percentile'] >= 0.7, 1, np.nan)
to something like
>>>df['Final Weight'] = np.where((df['Percentile'] >= 0.7) | (df['Final Weight'] != 0), 1, np.nan)
to allow that to be recognized and propagate.
Setup
Dataframe:
Stock Weight Percentile Finalweight
Date
2000-01-01 Apple 0.010 0.75 0
2000-01-01 IBM 0.011 0.40 0
2000-01-01 Google 0.012 0.45 0
2000-01-01 Nokia 0.022 0.81 0
2000-02-01 Apple 0.014 0.56 0
2000-02-01 Google 0.015 0.45 0
2000-02-01 Nokia 0.016 0.55 0
2000-03-01 Apple 0.020 0.52 0
2000-03-01 Google 0.030 0.51 0
2000-03-01 Nokia 0.040 0.57 0
Solution
df = df.reset_index()
#find historical max percentile for a Stock
df['max_percentile'] = df.apply(lambda x: df[df.Stock==x.Stock].iloc[:x.name].Percentile.max() if x.name>0 else x.Percentile, axis=1)
#set weight according to max_percentile and the current percentile
df['Finalweight'] = df.apply(lambda x: x.Weight if (x.Percentile>0.7) or (x.Percentile>0.5 and x.max_percentile>0.7) else 0, axis=1)
Out[1041]:
Date Stock Weight Percentile Finalweight max_percentile
0 2000-01-01 Apple 0.010 0.75 0.010 0.75
1 2000-01-01 IBM 0.011 0.40 0.000 0.40
2 2000-01-01 Google 0.012 0.45 0.000 0.45
3 2000-01-01 Nokia 0.022 0.81 0.022 0.81
4 2000-02-01 Apple 0.014 0.56 0.014 0.75
5 2000-02-01 Google 0.015 0.45 0.000 0.51
6 2000-02-01 Nokia 0.016 0.55 0.016 0.81
7 2000-03-01 Apple 0.020 0.52 0.020 0.75
8 2000-03-01 Google 0.030 0.51 0.000 0.51
9 2000-03-01 Nokia 0.040 0.57 0.040 0.81
Note
In the last row of your example data, Nokia's Percentile is 0.57 while in your results it becomes 0.47. In this example, I used 0.57 so the output is a bit different than yours for the last row.
I think you may want to use the pandas.Series rolling window method.
Perhaps something like this:
import pandas as pd
grouped = df.groupby('Stock')
df['MaxPercentileToDate'] = np.NaN
df.index = df['Date']
for name, group in grouped:
df.loc[df.Stock==name, 'MaxPercentileToDate'] = group['Percentile'].rolling(min_periods=0, window=4).max()
# Mask selects rows that have ever been greater than 0.75 (including current row in max)
# and are currently greater than 0.5
mask = ((df['MaxPercentileToDate'] > 0.75) & (df['Percentile'] > 0.5))
df.loc[mask, 'Finalweight'] = df.loc[mask, 'Weight']
I believe this assumes values are sorted by date (which your initial dataset seems to have), and you would also have to adjust the min_periods
parameter to be the max number of entries per stock.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.