简体   繁体   中英

Conditional Cumulative Sum of Multiple Rows in Dataframe

I am trying to find the cumulative sum for four consecutive rows in a dataframe based on a condition.

The new column ( 'veh_time_TOT' ) is a sum of four consecutive ' veh_time(s) ' values and the condition is ' Day_type ': Weekend or Weekday.

Here is how the data is now set up:

    veh-time(s) distance(m) Day_type
0   72  379.0   Weekday
1   70  379.0   Weekday
2   50  379.0   Weekday
3   60  379.0   Weekday
4   70  379.0   Weekday
5   65  379.0   Weekday
6   30  379.0   Weekend
7   35  379.0   Weekend
8   30  379.0   Weekend
9   30  379.0   Weekend
10  20  379.0   Weekend 

Here is the desired output:

    veh-time(s) distance(m) Day_type    veh_time_TOT
0   72  379.0   Weekday        0
1   70  379.0   Weekday        0
2   50  379.0   Weekday        0
3   60  379.0   Weekday        252
4   70  379.0   Weekday        250
5   65  379.0   Weekday        245
6   30  379.0   Weekend        0
7   35  379.0   Weekend        0
8   30  379.0   Weekend        0
9   30  379.0   Weekend        125
10  20  379.0   Weekend        115  

I've tried several things but the only thing I could find is using the .cumsum function which only finds the sum for 2 consecutive rows. The zeros in the " veh_time_TOT " are there because there haven't been 4 rows yet to make up the sum.

My thinking that this would be a combination of .cumsum and conditional if statement that goes on a loop.

What do you guys think? Any help is appreciated.

Here are the steps I took to get the desired column:

  • First, I set up your example DataFrame.

  • Next, I defined the three columns of interest (the column whose values will be the basis of the calculation, the column used for comparison, and the column name for the calculated quantity.

  • After that, I find all the rows that are eligible for this calculation (previous 4 rows have the same value for col_compare ).
  • I then iterate over this slice of the original DataFrame, summing the previous four values of col_val .

  • Lastly, I create the new column with the desired name of col_name_new

    • Initialize its values to zero
    • Fill the eligible locations with the list we generated in the previous step:

Here is my code, feel free to ask Q's in the comments!

import pandas as pd

# Setup

cols = ['veh-time(s)', 'distance(m)', 'Day_type']

data= [[72,  379.0 ,  'Weekday'],
       [70,  379.0 ,  'Weekday'],
       [50,  379.0 ,  'Weekday'],
       [60,  379.0 ,  'Weekday'],
       [70,  379.0 ,  'Weekday'],
       [65,  379.0 ,  'Weekday'],
       [30,  379.0 ,  'Weekend'],
       [35,  379.0 ,  'Weekend'],
       [30,  379.0 ,  'Weekend'],
       [30,  379.0 ,  'Weekend'],
       [20,  379.0 ,  'Weekend']]


df = pd.DataFrame(data,columns=cols )

# Define columns for potential future generalization

col_val='veh-time(s)'
col_compare='Day_type'
col_name_new = 'veh_time_TOT'

# DataFrame slice of rows eligible for calculation

cut_prev_four =  (df[col_compare].shift(1)==df[col_compare]) \
                &(df[col_compare].shift(2)==df[col_compare].shift(1)) \
                &(df[col_compare].shift(3)==df[col_compare].shift(2))

df_consecutive = df[cut_prev_four]

# Perform calculation on eligible rows. Store in list

prev_four_list = []
for i,row in df_consecutive.iterrows():
    prev_four_vals = df.iloc[i-3:i+1][col_val].values
    print(i, prev_four_vals, sum(prev_four_vals) )
    prev_four_list.append(sum(prev_four_vals))

# Set new column to the calculated values

df[col_name_new] = 0
df.loc[cut_prev_four, col_name_new] = prev_four_list

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM