简体   繁体   English

在数据框python上应用行明智的条件函数

[英]Apply row wise conditional function on dataframe python

I have a dataframe in which I want to execute a function that checks if the actual value is a relative maximum, and check if the previous ''n'' values are lower than the actual value.我有一个数据框,我想在其中执行一个函数,该函数检查实际值是否为相对最大值,并检查先前的“n”值是否低于实际值。

Having a dataframe 'df_data':有一个数据框“df_data”:

temp_list = [128.71, 130.2242, 131.0, 131.45, 129.69, 130.17, 132.63, 131.63, 131.0499, 131.74, 133.6116, 134.74, 135.99, 138.789, 137.34, 133.46, 132.43, 134.405, 128.31, 129.1]
df_data = pd.DataFrame(temp)

First I create a function that will check the previous conditions:首先,我创建一个函数来检查之前的条件:

def get_max(high, rolling_max, prev,post):
if ((high > prev) & (high>post) & (high>rolling_max)):
    return 1
else: 
    return 0
df_data['rolling_max'] = df_data.high.rolling(n).max().shift()

Then I apply previous condition row wise:然后我按行应用先前的条件:

df_data['ismax'] = df_data.apply(lambda x: get_max(df_data['high'], df_data['rolling_max'],df_data['high'].shift(1),df_data['high'].shift(-1)),axis = 1)

The problem is that I have always get the following error:问题是我总是收到以下错误:

ValueError: The truth value of a Series is ambiguous. ValueError:系列的真值不明确。 Use a.empty, a.bool(), a.item(), a.any() or a.all().使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()。

Which comes due to applying the boolean condition from 'get_max' function to a Serie.这是由于将 'get_max' 函数中的布尔条件应用于 Serie。

I will love to have a vectorized function, not using loops.我很想拥有一个矢量化的函数,而不是使用循环。

尝试:

df_data['ismax'] = ((df_data['high'].gt(df_data.high.rolling(n).max().shift())) & (df_data['high'].gt(df_data['high'].shift(1))) & (df_data['high'].gt(df_data['high'].shift(-1)))).astype(int)

The error is occuring because you are sending the entire series (entire column) to your get_max function rather than doing it row-wise.发生错误是因为您将整个系列(整列)发送到 get_max 函数而不是按行发送。 Creating new columns for the shifted "prev" and "post" values and then using df.apply(func, axis = 1) normally will work fine here.为移动的“prev”和“post”值创建新列,然后使用df.apply(func, axis = 1)通常在这里可以正常工作。

As you have hinted at, this solution is quite inefficient and looping through every row will become much slower as your dataframe increases in size.正如您所暗示的,这个解决方案效率很低,随着数据帧大小的增加,遍历每一行会变得更慢。

On my computer, the below code posts:在我的电脑上,下面的代码发布:

  • LIST_MULTIPLIER = 1, Vectorised code: 0.29s, Row-wise code: 0.38s LIST_MULTIPLIER = 1,矢量化代码:0.29s,逐行代码:0.38s
  • LIST_MULTIPLIER = 100, Vectorised code: 0.31s, Row-wise code = 13.27s LIST_MULTIPLIER = 100,矢量化代码:0.31s,逐行代码 = 13.27s

In general therefore it is best to avoid using df.apply(..., axis = 1) as you can almost always get a better solution using logical operators.通常,因此最好避免使用df.apply(..., axis = 1)因为您几乎总是可以使用逻辑运算符获得更好的解决方案。

import pandas as pd
from datetime import datetime

LIST_MULTIPLIER = 100
ITERATIONS = 100

def get_dataframe():
    temp_list = [128.71, 130.2242, 131.0, 131.45, 129.69, 130.17, 132.63, 
                 131.63, 131.0499, 131.74, 133.6116, 134.74, 135.99, 
                 138.789, 137.34, 133.46, 132.43, 134.405, 128.31, 129.1] * LIST_MULTIPLIER
    df = pd.DataFrame(temp_list)
    df.columns = ['high']
    return df

df_original = get_dataframe()

t1 = datetime.now()

for i in range(ITERATIONS):
    df = df_original.copy()
    df['rolling_max'] = df.high.rolling(2).max().shift()
    df['high_prev'] = df['high'].shift(1)
    df['high_post'] = df['high'].shift(-1)
    
    mask_prev = df['high'] > df['high_prev']
    mask_post = df['high'] > df['high_post']
    mask_rolling = df['high'] > df['rolling_max']
    
    mask_max = mask_prev & mask_post & mask_rolling
    
    df['ismax'] = 0
    df.loc[mask_max, 'ismax'] = 1
    
    
t2 = datetime.now()
print(f"{t2 - t1}")
df_first_method = df.copy()


t3 = datetime.now()

def get_max_rowwise(row):
    if ((row.high > row.high_prev) & 
        (row.high > row.high_post) & 
        (row.high > row.rolling_max)):
        return 1
    else: 
        return 0
    
for i in range(ITERATIONS):
    df = df_original.copy()
    df['rolling_max'] = df.high.rolling(2).max().shift()
    df['high_prev'] = df['high'].shift(1)
    df['high_post'] = df['high'].shift(-1)
    df['ismax'] = df.apply(get_max_rowwise, axis = 1)

t4 = datetime.now()
print(f"{t4 - t3}")
df_second_method = df.copy()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM