简体   繁体   中英

How to apply a function on a DataFrame Column using multiple rows and columns as input?

I have a sequence of events, and based on some variables (previous command, previous/current code and previous/current status) I need to decide which command is related to that event.

I actually have a code that works as expected, but it's kind of slow. So I've tried to use df.apply, but I don't think it's possible to use more than the current element as input. (The code starts at 1 because the first row is always a "begin" command)

def mark_commands(df):
    for i in range(1, len(df)):
        prev_command = df.loc[i-1, 'Command']
        prev_code, cur_code = df.loc[i-1, 'Code'], df.loc[i, 'Code']
        prev_status, cur_status = df.loc[i-1, 'Status'], df.loc[i, 'Status']

        if (prev_command == "end" and 
            ((cur_code == 810 and cur_status in [10, 15]) or 
            (cur_code == 830 and cur_status == 15))):

            df.loc[i, 'Command'] = "ignore"

        elif ((cur_code == 800 and cur_status in [20, 25]) or 
            (cur_code in [810, 830] and cur_status in [10, 15])):

            df.loc[i, 'Command'] = "end"

        elif ((prev_code != 800) and 
            ((cur_code == 820 and cur_status == 25) or 
            (cur_code == 820 and cur_status == 20 and 
                prev_code in [810, 820] and prev_status == 20) or 
            (cur_code == 830 and cur_status == 25 and 
                prev_code == 820 and prev_status == 20))):

            df.loc[i, 'Command'] = "continue"

        else:

            df.loc[i, 'Command'] = "begin"

    return df

And here is a correctly labeled sample in a CSV format (Which can serve as input, since the only difference is that everything on the command line is empty after the first begin):

Code,Status,Command
810,20,begin
810,10,end
810,25,begin
810,15,end
810,15,ignore
810,20,begin
810,10,end
810,25,begin
810,15,end
810,15,ignore
810,20,begin
800,20,end
810,10,ignore
810,25,begin
820,25,continue
820,25,continue
820,25,continue
820,25,continue
800,25,end

You're code is mostly perfect (you could have used df.iterrows() , more bulletproof if your index is not linear, in the for loop but it wouldn't have changed the speed).

After trying extensively to use df.apply , I realized there was a fatal flow since your "Command" column is continuously updating from one row to another. The following wouldn't work since df is somehow "static":

df['Command'] = df.apply(lambda row: mark_commands(row), axis=1)

Eventually, to save you some calculation, you could insert a continue statement each time a condition is met if your if , elif statements to go directly to the next iteration:

if (prev_command == "end" and ....) :
    df.loc[i, 'Command'] = "ignore"
    continue

That being said, your code works great.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM