Pandas DataFrames: Efficiently find next value in one column where another column has a greater value

Question

The title describes my situation. I already have a working version of this, but it is very inefficient when scaled to large DataFrames (>1M rows). I was wondering if anyone has a better idea of doing this.

Example with solution and code

Create a new column next_time that has the next value of time where the price column is greater than the current row.

import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
print(df)
   time  price
0    15  10.00
1    30  10.01
2    45  10.00
3    60  10.01
4    75  10.02
5    90   9.99

series_to_concat = []
for price in df['price'].unique():
    index_equal_to_price = df[df['price'] == price].index
    series_time_greater_than_price = df[df['price'] > price]['time']
    time_greater_than_price_backfilled = series_time_greater_than_price.reindex(index_equal_to_price.union(series_time_greater_than_price.index)).fillna(method='backfill')

    series_to_concat.append(time_greater_than_price_backfilled.reindex(index_equal_to_price))

df['next_time'] = pd.concat(series_to_concat, sort=False)

print(df)
   time  price  next_time
0    15  10.00       30.0
1    30  10.01       75.0
2    45  10.00       60.0
3    60  10.01       75.0
4    75  10.02        NaN
5    90   9.99        NaN

This gets me the desired result. When scaled up to some large dataframes, calculating this can take a few minutes. Does anyone have a better idea of how to approach this?

Edit: Clarification of constraints

We can assume the dataframe is sorted by time. Another way to word this would be, given any row n (Time_ n , Price_ n ), 0 <= n <= len(df) - 1, find x such that Time_ x > Time_ n AND Price_ x > Price_ n AND there is no y such that n < y < x with Price_ y > Price_ n .

Answer 1

These solutions were faster when I tested with %timeit on this sample, but I tested on a larger dataframe and they were much slower than your solution. It would be interesting to see if any of the 3 solutions are faster in your larger dataframe. I would look into dask or check out: https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html

I hope someone else is able to post a more efficient solution. Some different answers below:

You can achieve this with a next one-liner that loops through both the time and price columns simultaneously with zip . The next function works exactly the same as a list comprehension, but you use need to parentheses instead of brackets, and it only returns the first True value. You also need to pass None to handle errors as a parameter within in the next function.
You need to pass axis=1 , since you are comparing column-wise.

This should speed up performance, as you don't loop through the entire column as the iteration stops after returning the first value and moves to the next row.

import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
print(df)
   time  price
0    15  10.00
1    30  10.01
2    45  10.00
3    60  10.01
4    75  10.02
5    90   9.99

df['next_time'] = (df.apply(lambda x: next((z for (y, z) in zip(df['price'], df['time'])
                                            if y > x['price'] if z > x['time']), None), axis=1))
df
Out[1]: 
   time  price  next_time
0    15  10.00       30.0
1    30  10.01       75.0
2    45  10.00       60.0
3    60  10.01       75.0
4    75  10.02        NaN
5    90   9.99        NaN

As you can see list comprehension would return the same result, but in theory will be a lot slower... as the total number of iterating would increase significantly especially with a large dataframe.

df['next_time'] = (df.apply(lambda x: [z for (y, z) in zip(df['price'], df['time'])
                                       if y > x['price'] if z > x['time']], axis=1)).str[0]
df
Out[2]: 
   time  price  next_time
0    15  10.00       30.0
1    30  10.01       75.0
2    45  10.00       60.0
3    60  10.01       75.0
4    75  10.02        NaN
5    90   9.99        NaN

Another Option creating a function with some numpy and np.where():

def closest(x):
    try:
        lst = df.groupby(df['price'].cummax())['time'].transform('first')
        lst = np.asarray(lst)
        lst = lst[lst>x] 
        idx = (np.abs(lst - x)).argmin() 
        return lst[idx]
    except ValueError:
        pass


df['next_time'] = np.where((df['price'].shift(-1) > df['price']),
                            df['time'].shift(-1),
                            df['time'].apply(lambda x: closest(x)))

Answer 2

This one returned a variation of your dataframe with 1,000,000 rows and 162,000 unique prices for me in less than 7 seconds. As such, I think that since you ran it on 660,000 rows and 12,000 unique prices, the increase in speed would be 100x-1000x.

The added complexity of your question is the condition that the closest higher price must be at a later time. This answer https://stackoverflow.com/a/53553226/6366770 helped me discover the bisect functions, but it didn't have your added complexity of having to rely on a time column. As such, I had to tackle the problem from a couple of different angles (as you mentioned in a comment regarding my np.where() to break it down into a couple of different methods).

import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})

def bisect_right(a, x, lo=0, hi=None):
    if lo < 0:
        raise ValueError('lo must be non-negative')
    if hi is None:
        hi = len(a)
    while lo < hi:
        mid = (lo+hi)//2
        if x < a[mid]: hi = mid
        else: lo = mid+1
    return lo


def get_closest_higher(df, col, val):
    higher_idx = bisect_right(df[col].values, val)
    return higher_idx


df = df.sort_values(['price', 'time']).reset_index(drop=True)
df['next_time'] = df['price'].apply(lambda x: get_closest_higher(df, 'price', x))

df['next_time'] = df['next_time'].map(df['time'])
df['next_time'] = np.where(df['next_time'] <= df['time'], np.nan, df['next_time'] )
df = df.sort_values('time').reset_index(drop=True)
df['next_time'] = np.where((df['price'].shift(-1) > df['price'])
                           ,df['time'].shift(-1),
                           df['next_time'])
df['next_time'] = df['next_time'].ffill()
df['next_time'] = np.where(df['next_time'] <= df['time'], np.nan, df['next_time'])
df

Out[1]: 
   time  price  next_time
0    15  10.00       30.0
1    30  10.01       75.0
2    45  10.00       60.0
3    60  10.01       75.0
4    75  10.02        NaN
5    90   9.99        NaN

Answer 3

David did come up with a great solution for finding the closest greater price at a later time. However, I did want to find the very next occurrence of a greater price at a later time though. Working with a coworker of mine, we found this solution.

Stack containing tuples (index, price)

Iterate through all rows (index i)
While the stack is non-empty AND the top of the stack has a lesser price, then pop and fill in the popped index with times[index]
Push (i, prices[i]) onto the stack

import numpy as np
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
print(df)
   time  price
0    15  10.00
1    30  10.01
2    45  10.00
3    60  10.01
4    75  10.02
5    90   9.99

times = df['time'].to_numpy()
prices = df['price'].to_numpy()
stack = []
next_times = np.full(len(df), np.nan)
for i in range(len(df)):
    while stack and prices[i] > stack[-1][1]:
        stack_time_index, stack_price = stack.pop()
        next_times[stack_time_index] = times[i]
    stack.append((i, prices[i]))
df['next_time'] = next_times

print(df)
   time  price  next_time
0    15  10.00       30.0
1    30  10.01       75.0
2    45  10.00       60.0
3    60  10.01       75.0
4    75  10.02        NaN
5    90   9.99        NaN

This solution actually performs very fast. I am not totally sure, but I believe the complexity would be close to O(n) since it is one full pass through the entire dataframe. The reason this performs so well, is the stack is essentially sorted, where the largest prices will be at the bottom, and the smallest price is at the top of the stack.

Here is my test with an actual dataframe in action

print(f'{len(df):,.0f} rows with {len(df["price"].unique()):,.0f} unique prices ranging from ${df["price"].min():,.2f} to ${df["price"].max():,.2f}')
667,037 rows with 11,786 unique prices ranging from $1,857.52 to $2,022.00

def find_next_time_with_greater_price(df):
    times = df['time'].to_numpy()
    prices = df['price'].to_numpy()
    stack = []
    next_times = np.full(len(df), np.nan)
    for i in range(len(df)):
        while stack and prices[i] > stack[-1][1]:
            stack_time_index, stack_price = stack.pop()
            next_times[stack_time_index] = times[i]
        stack.append((i, prices[i]))
    return next_times

%timeit -n10 -r10 df['next_time'] = find_next_time_with_greater_price(df)
434 ms ± 11.8 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

Pandas DataFrames: Efficiently find next value in one column where another column has a greater value

Question

3 answers

solution1
1 2020-09-24 22:43:17

solution2
1 2020-09-25 23:37:33

solution3
1 ACCPTED 2020-09-28 17:48:56

Pandas DataFrames: Efficiently find next value in one column where another column has a greater value

Question

3 answers

solution1 1 2020-09-24 22:43:17

solution2 1 2020-09-25 23:37:33

solution3 1 ACCPTED 2020-09-28 17:48:56

solution1
1 2020-09-24 22:43:17

solution2
1 2020-09-25 23:37:33

solution3
1 ACCPTED 2020-09-28 17:48:56