简体   繁体   中英

Apply row wise conditional function on dataframe python

I have a dataframe in which I want to execute a function that checks if the actual value is a relative maximum, and check if the previous ''n'' values are lower than the actual value.

Having a dataframe 'df_data':

temp_list = [128.71, 130.2242, 131.0, 131.45, 129.69, 130.17, 132.63, 131.63, 131.0499, 131.74, 133.6116, 134.74, 135.99, 138.789, 137.34, 133.46, 132.43, 134.405, 128.31, 129.1]
df_data = pd.DataFrame(temp)

First I create a function that will check the previous conditions:

def get_max(high, rolling_max, prev,post):
if ((high > prev) & (high>post) & (high>rolling_max)):
    return 1
else: 
    return 0
df_data['rolling_max'] = df_data.high.rolling(n).max().shift()

Then I apply previous condition row wise:

df_data['ismax'] = df_data.apply(lambda x: get_max(df_data['high'], df_data['rolling_max'],df_data['high'].shift(1),df_data['high'].shift(-1)),axis = 1)

The problem is that I have always get the following error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Which comes due to applying the boolean condition from 'get_max' function to a Serie.

I will love to have a vectorized function, not using loops.

尝试:

df_data['ismax'] = ((df_data['high'].gt(df_data.high.rolling(n).max().shift())) & (df_data['high'].gt(df_data['high'].shift(1))) & (df_data['high'].gt(df_data['high'].shift(-1)))).astype(int)

The error is occuring because you are sending the entire series (entire column) to your get_max function rather than doing it row-wise. Creating new columns for the shifted "prev" and "post" values and then using df.apply(func, axis = 1) normally will work fine here.

As you have hinted at, this solution is quite inefficient and looping through every row will become much slower as your dataframe increases in size.

On my computer, the below code posts:

  • LIST_MULTIPLIER = 1, Vectorised code: 0.29s, Row-wise code: 0.38s
  • LIST_MULTIPLIER = 100, Vectorised code: 0.31s, Row-wise code = 13.27s

In general therefore it is best to avoid using df.apply(..., axis = 1) as you can almost always get a better solution using logical operators.

import pandas as pd
from datetime import datetime

LIST_MULTIPLIER = 100
ITERATIONS = 100

def get_dataframe():
    temp_list = [128.71, 130.2242, 131.0, 131.45, 129.69, 130.17, 132.63, 
                 131.63, 131.0499, 131.74, 133.6116, 134.74, 135.99, 
                 138.789, 137.34, 133.46, 132.43, 134.405, 128.31, 129.1] * LIST_MULTIPLIER
    df = pd.DataFrame(temp_list)
    df.columns = ['high']
    return df

df_original = get_dataframe()

t1 = datetime.now()

for i in range(ITERATIONS):
    df = df_original.copy()
    df['rolling_max'] = df.high.rolling(2).max().shift()
    df['high_prev'] = df['high'].shift(1)
    df['high_post'] = df['high'].shift(-1)
    
    mask_prev = df['high'] > df['high_prev']
    mask_post = df['high'] > df['high_post']
    mask_rolling = df['high'] > df['rolling_max']
    
    mask_max = mask_prev & mask_post & mask_rolling
    
    df['ismax'] = 0
    df.loc[mask_max, 'ismax'] = 1
    
    
t2 = datetime.now()
print(f"{t2 - t1}")
df_first_method = df.copy()


t3 = datetime.now()

def get_max_rowwise(row):
    if ((row.high > row.high_prev) & 
        (row.high > row.high_post) & 
        (row.high > row.rolling_max)):
        return 1
    else: 
        return 0
    
for i in range(ITERATIONS):
    df = df_original.copy()
    df['rolling_max'] = df.high.rolling(2).max().shift()
    df['high_prev'] = df['high'].shift(1)
    df['high_post'] = df['high'].shift(-1)
    df['ismax'] = df.apply(get_max_rowwise, axis = 1)

t4 = datetime.now()
print(f"{t4 - t3}")
df_second_method = df.copy()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM