How to put two different flags based on two thresholds when column value changes in a pandas Dataframe

Question

I have a data-frame, say df (last two columns I consider as datetime64[ns] not str ),

data = [['abc', 'abc1', '1_1',    '2021-06-01 06:00:00.035999', '2021-06-02 09:59:59.964000'],
 ['abc',  'abc1',  '1_2',  '2021-06-01 06:00:00.035999', '2021-06-02 09:59:59.964000'],
 ['abc',  'abc2',  '1_1',  '2021-06-01 06:00:00.035999', '2021-06-01 20:59:59.964001'],
 ['abc',  'abc2',  '1_2',  '2021-06-01 06:00:00.035999', '2021-06-01 20:59:59.964001'],
 ['abc',  'abc3',  '1_1',  '2021-06-01 06:00:00.035999', '2021-06-03 06:29:59.964000'],
 ['abc',  'abc3',  '1_2',  '2021-06-01 06:00:00.035999', '2021-06-03 06:29:59.964000'],
 ['abc',  'abc3',  '2_1',  '2021-06-04 06:30:00.000001', '2021-06-04 07:44:59.927999'],
 ['abc',  'abc3',  '2_2',  '2021-06-04 06:30:00.000001', '2021-06-04 07:44:59.927999']]
 df = pd.DataFrame(data, columns = ['vehicle', 'order', 'work', 'Start', 'Finish'])

I want to find the time between two works. For example, I want to calculate the time between the finishing time of work 1_1 (vehicle: abc and order: abc1) and starting time of work 1_2 . I am calculating it for each distinct order .

  vehicle  order    work             Start                           Finish
0     abc  abc1     1_1        2021-06-01 06:00:00.035999     2021-06-02 09:59:59.964000
1     abc  abc1     1_2        2021-06-01 06:00:00.035999     2021-06-02 09:59:59.964000
2     abc  abc2     1_1        2021-06-01 06:00:00.035999     2021-06-01 20:59:59.964001
3     abc  abc2     1_2        2021-06-01 06:00:00.035999     2021-06-01 20:59:59.964001
4     abc  abc3     1_1        2021-06-01 06:00:00.035999     2021-06-03 06:29:59.964000
5     abc  abc3     1_2        2021-06-01 06:00:00.035999     2021-06-03 06:29:59.964000
6     abc  abc3     2_1        2021-06-04 06:30:00.000001     2021-06-04 07:44:59.927999
7     abc  abc3     2_2        2021-06-04 06:30:00.000001     2021-06-04 07:44:59.927999

I have written one code for this and it is working.

po_unique = df['order'].unique()
appended_data = []
for pos in po_unique:
    x1 = df.copy()
    x1 = x1.loc[x1['order'] == pos, :]
    x1.reset_index(drop = True, inplace = True)
    #print(x1)
    aList = []
    for i in range(len(x1) - 1):
        t = (x1.Start[i + 1] - x1.Finish[i])/ dt.timedelta(hours=24)
        aList.append(t)
    aList.insert(0, 0)
    x2 = x1.copy()
    x2['flag'] = aList
    appended_data.append(x2)
appended_data = pd.concat(appended_data)

I would like to receive some views about the code. Is there any alternative way to do this? The output for appended_data[['order', 'work', 'flag']] looks like

Out[112]: 
  order work      flag
0  abc1  1_1     0.000000
1  abc1  1_2     -1.166666
0  abc2  1_1     0.000000
1  abc2  1_2     -0.624999
0  abc3  1_1      0.000000
1  abc3  1_2     -2.020833
2  abc3  2_1      1.000000
3  abc3  2_2     -0.052082

Now I want to create another column flag1 such that if value of the flag column is greater than some threshold value then it will put 'F' in this column. I can do this also by using .apply() function like

thresold = 0.9
appended_data['flag1'] = appended_data.apply(lambda row: 'F' if row['flag'] > thresold else ' ', axis = 1)

but if I want to put flag for two different thresholds, one is for "inside" like 1_1 to 1_2 and another one is for "outside" (when prefix changes) like 1_2 to 2_1 , then what to do. Say threshold_sameprefix = -1.0 threshold_diffprefix = 0.8

Expected output

    vehicle order  work      flag     flag1
     abc     abc1  1_1     0.000000      
     abc     abc1  1_2     -1.166666      
     abc     abc2  1_1     0.000000      
     abc     abc2  1_2     -0.624999     F1 
     abc     abc3  1_1     0.000000      
     abc     abc3  1_2     -2.020833      
     abc     abc3  2_1     1.000000      F2
     abc     abc3  2_2     -0.052082     F1

Please do not take minimum threshold and apply the logic what I did. I want to create a logic where I want to assign flag in an iterative way so that I can customize it.

Answer 1

Let's approach by the following steps:

1) Split the work id into 2 parts: work_prefix and work_suffix :

df[['work_prefix', 'work_suffix']] = df['work'].str.split('_', expand=True)

2) Then, define a set of boolean masks corresponding to the conditions. These boolean masks are set considering group boundary of same order using .groupby() :

threshold_sameprefix = -1.0       # given threshold value
threshold_diffprefix = 0.8        # given threshold value

w_ne = df['work'] != df.groupby('order')['work'].shift()          # work id changed
wp_eq = df['work_prefix'] == df.groupby('order')['work_prefix'].shift()   # same work prefix
wp_ne = df['work_prefix'] != df.groupby('order')['work_prefix'].shift()   # different work prefix

m1 = w_ne & wp_eq & (df['flag'] > threshold_sameprefix)       # condition for 'F1'
m2 = w_ne & wp_ne & (df['flag'] > threshold_diffprefix)       # condition for 'F2'

3) Finally, use .loc with the boolean masks to set up flag1 with values F1 and F2 , as follows:

df['flag1'] = ' '               # init flag1 to blank
df.loc[m1, 'flag1'] = 'F1'
df.loc[m2, 'flag1'] = 'F2'

Input

  vehicle order work      flag
0     abc  abc1  1_1  0.000000
1     abc  abc1  1_2 -1.166666
2     abc  abc2  1_1  0.000000
3     abc  abc2  1_2 -0.624999
4     abc  abc3  1_1  0.000000
5     abc  abc3  1_2 -2.020833
6     abc  abc3  2_1  1.000000
7     abc  abc3  2_2 -0.052082

Output:

  vehicle order work      flag work_prefix work_suffix flag1
0     abc  abc1  1_1  0.000000           1           1      
1     abc  abc1  1_2 -1.166666           1           2      
2     abc  abc2  1_1  0.000000           1           1      
3     abc  abc2  1_2 -0.624999           1           2    F1
4     abc  abc3  1_1  0.000000           1           1      
5     abc  abc3  1_2 -2.020833           1           2      
6     abc  abc3  2_1  1.000000           2           1    F2
7     abc  abc3  2_2 -0.052082           2           2    F1

Optionally, you can remove the 2 working columns work_prefix and work_suffix by:

df = df.drop(['work_prefix', 'work_suffix'], axis=1)

Bonus Codes

To set up your first column flag more efficiently instead of using looping, you can use:

data = [['abc', 'abc1', '1_1',    '2021-06-01 06:00:00.035999', '2021-06-02 09:59:59.964000'],
 ['abc',  'abc1',  '1_2',  '2021-06-01 06:00:00.035999', '2021-06-02 09:59:59.964000'],
 ['abc',  'abc2',  '1_1',  '2021-06-01 06:00:00.035999', '2021-06-01 20:59:59.964001'],
 ['abc',  'abc2',  '1_2',  '2021-06-01 06:00:00.035999', '2021-06-01 20:59:59.964001'],
 ['abc',  'abc3',  '1_1',  '2021-06-01 06:00:00.035999', '2021-06-03 06:29:59.964000'],
 ['abc',  'abc3',  '1_2',  '2021-06-01 06:00:00.035999', '2021-06-03 06:29:59.964000'],
 ['abc',  'abc3',  '2_1',  '2021-06-04 06:30:00.000001', '2021-06-04 07:44:59.927999'],
 ['abc',  'abc3',  '2_2',  '2021-06-04 06:30:00.000001', '2021-06-04 07:44:59.927999']]
df = pd.DataFrame(data, columns = ['vehicle', 'order', 'work', 'Start', 'Finish'])

df['Start'] = pd.to_datetime(df['Start'])
df['Finish'] = pd.to_datetime(df['Finish'])

Main codes to replace your codes with looping:

df['flag'] = ((df['Start'] - df.groupby('order')['Finish'].shift()) / pd.Timedelta(days=1)).fillna(0)

Result:

print(df)

  vehicle order work                      Start                     Finish      flag
0     abc  abc1  1_1 2021-06-01 06:00:00.035999 2021-06-02 09:59:59.964000  0.000000
1     abc  abc1  1_2 2021-06-01 06:00:00.035999 2021-06-02 09:59:59.964000 -1.166666
2     abc  abc2  1_1 2021-06-01 06:00:00.035999 2021-06-01 20:59:59.964001  0.000000
3     abc  abc2  1_2 2021-06-01 06:00:00.035999 2021-06-01 20:59:59.964001 -0.624999
4     abc  abc3  1_1 2021-06-01 06:00:00.035999 2021-06-03 06:29:59.964000  0.000000
5     abc  abc3  1_2 2021-06-01 06:00:00.035999 2021-06-03 06:29:59.964000 -2.020833
6     abc  abc3  2_1 2021-06-04 06:30:00.000001 2021-06-04 07:44:59.927999  1.000000
7     abc  abc3  2_2 2021-06-04 06:30:00.000001 2021-06-04 07:44:59.927999 -0.052082

How to put two different flags based on two thresholds when column value changes in a pandas Dataframe

Question

1 answers

solution1
1 ACCPTED 2021-06-27 20:49:38

Let's approach by the following steps:

Bonus Codes

How to put two different flags based on two thresholds when column value changes in a pandas Dataframe

Question

1 answers

solution1 1 ACCPTED 2021-06-27 20:49:38

Let's approach by the following steps:

Bonus Codes

solution1
1 ACCPTED 2021-06-27 20:49:38