The title describes my situation. I already have a working version of this, but it is very inefficient when scaled to large DataFrames (>1M rows). I was wondering if anyone has a better idea of doing this.
Example with solution and code
Create a new column next_time
that has the next value of time where the price
column is greater than the current row.
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
print(df)
time price
0 15 10.00
1 30 10.01
2 45 10.00
3 60 10.01
4 75 10.02
5 90 9.99
series_to_concat = []
for price in df['price'].unique():
index_equal_to_price = df[df['price'] == price].index
series_time_greater_than_price = df[df['price'] > price]['time']
time_greater_than_price_backfilled = series_time_greater_than_price.reindex(index_equal_to_price.union(series_time_greater_than_price.index)).fillna(method='backfill')
series_to_concat.append(time_greater_than_price_backfilled.reindex(index_equal_to_price))
df['next_time'] = pd.concat(series_to_concat, sort=False)
print(df)
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
This gets me the desired result. When scaled up to some large dataframes, calculating this can take a few minutes. Does anyone have a better idea of how to approach this?
Edit: Clarification of constraints
We can assume the dataframe is sorted by time. Another way to word this would be, given any row n (Time_ n , Price_ n ), 0 <= n <= len(df) - 1, find x such that Time_ x > Time_ n AND Price_ x > Price_ n AND there is no y such that n < y < x with Price_ y > Price_ n .
These solutions were faster when I tested with %timeit
on this sample, but I tested on a larger dataframe and they were much slower than your solution. It would be interesting to see if any of the 3 solutions are faster in your larger dataframe. I would look into dask
or check out: https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html
I hope someone else is able to post a more efficient solution. Some different answers below:
next
one-liner that loops through both the time
and price
columns simultaneously with zip
. The next
function works exactly the same as a list comprehension, but you use need to parentheses instead of brackets, and it only returns the first True
value. You also need to pass None
to handle errors as a parameter within in the next
function.axis=1
, since you are comparing column-wise.This should speed up performance, as you don't loop through the entire column as the iteration stops after returning the first value and moves to the next row.
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
print(df)
time price
0 15 10.00
1 30 10.01
2 45 10.00
3 60 10.01
4 75 10.02
5 90 9.99
df['next_time'] = (df.apply(lambda x: next((z for (y, z) in zip(df['price'], df['time'])
if y > x['price'] if z > x['time']), None), axis=1))
df
Out[1]:
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
As you can see list comprehension would return the same result, but in theory will be a lot slower... as the total number of iterating would increase significantly especially with a large dataframe.
df['next_time'] = (df.apply(lambda x: [z for (y, z) in zip(df['price'], df['time'])
if y > x['price'] if z > x['time']], axis=1)).str[0]
df
Out[2]:
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
Another Option creating a function with some numpy
and np.where():
def closest(x):
try:
lst = df.groupby(df['price'].cummax())['time'].transform('first')
lst = np.asarray(lst)
lst = lst[lst>x]
idx = (np.abs(lst - x)).argmin()
return lst[idx]
except ValueError:
pass
df['next_time'] = np.where((df['price'].shift(-1) > df['price']),
df['time'].shift(-1),
df['time'].apply(lambda x: closest(x)))
This one returned a variation of your dataframe with 1,000,000 rows and 162,000 unique prices for me in less than 7 seconds. As such, I think that since you ran it on 660,000 rows and 12,000 unique prices, the increase in speed would be 100x-1000x.
The added complexity of your question is the condition that the closest higher price must be at a later time. This answer https://stackoverflow.com/a/53553226/6366770 helped me discover the bisect
functions, but it didn't have your added complexity of having to rely on a time column. As such, I had to tackle the problem from a couple of different angles (as you mentioned in a comment regarding my np.where()
to break it down into a couple of different methods).
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
def bisect_right(a, x, lo=0, hi=None):
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if x < a[mid]: hi = mid
else: lo = mid+1
return lo
def get_closest_higher(df, col, val):
higher_idx = bisect_right(df[col].values, val)
return higher_idx
df = df.sort_values(['price', 'time']).reset_index(drop=True)
df['next_time'] = df['price'].apply(lambda x: get_closest_higher(df, 'price', x))
df['next_time'] = df['next_time'].map(df['time'])
df['next_time'] = np.where(df['next_time'] <= df['time'], np.nan, df['next_time'] )
df = df.sort_values('time').reset_index(drop=True)
df['next_time'] = np.where((df['price'].shift(-1) > df['price'])
,df['time'].shift(-1),
df['next_time'])
df['next_time'] = df['next_time'].ffill()
df['next_time'] = np.where(df['next_time'] <= df['time'], np.nan, df['next_time'])
df
Out[1]:
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
David did come up with a great solution for finding the closest greater price at a later time. However, I did want to find the very next occurrence of a greater price at a later time though. Working with a coworker of mine, we found this solution.
Stack containing tuples (index, price)
import numpy as np
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
print(df)
time price
0 15 10.00
1 30 10.01
2 45 10.00
3 60 10.01
4 75 10.02
5 90 9.99
times = df['time'].to_numpy()
prices = df['price'].to_numpy()
stack = []
next_times = np.full(len(df), np.nan)
for i in range(len(df)):
while stack and prices[i] > stack[-1][1]:
stack_time_index, stack_price = stack.pop()
next_times[stack_time_index] = times[i]
stack.append((i, prices[i]))
df['next_time'] = next_times
print(df)
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
This solution actually performs very fast. I am not totally sure, but I believe the complexity would be close to O(n) since it is one full pass through the entire dataframe. The reason this performs so well, is the stack is essentially sorted, where the largest prices will be at the bottom, and the smallest price is at the top of the stack.
Here is my test with an actual dataframe in action
print(f'{len(df):,.0f} rows with {len(df["price"].unique()):,.0f} unique prices ranging from ${df["price"].min():,.2f} to ${df["price"].max():,.2f}')
667,037 rows with 11,786 unique prices ranging from $1,857.52 to $2,022.00
def find_next_time_with_greater_price(df):
times = df['time'].to_numpy()
prices = df['price'].to_numpy()
stack = []
next_times = np.full(len(df), np.nan)
for i in range(len(df)):
while stack and prices[i] > stack[-1][1]:
stack_time_index, stack_price = stack.pop()
next_times[stack_time_index] = times[i]
stack.append((i, prices[i]))
return next_times
%timeit -n10 -r10 df['next_time'] = find_next_time_with_greater_price(df)
434 ms ± 11.8 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.