How to compare and then concatenate information from two different rows using python pandas data frames

Question

I have written this code:

import pandas as pd
import numpy as np

input_table = {'W' : pd.Series([1.1,2.1,3.1,4.1,5.1,6.1], index = ['1','2','3','4','5','6']),
     'X' : pd.Series([7.,8.,9.,10.,11.,12.], index = ['1','2','3','4','5','6']),
     'Y' : pd.Series(['A','B','C','D','E','E'], index = ['1','2','3','4','5','6']),
     'Z' : pd.Series(['First',' ','Last','First',' ','Last'], ['1','2','3','4','5','6'])}

output_table = pd.DataFrame(input_table)

output_table['Previous_Y'] = output_table['Y']

output_table.Previous_Y = output_table.Previous_Y.shift(1)

def Calc_flowpath(x):
    if x['Z'] == 'First':
        return x['Y']
    else:
        return x['Previous_Y'] + x['Y']           

output_table['Flowpath'] = output_table.apply(Calc_flowpath, axis=1)

print output_table

And my output is (as expected):

     W     X  Y      Z Previous_Y Flowpath
1  1.1   7.0  A  First        NaN        A
2  2.1   8.0  B                 A       AB
3  3.1   9.0  C   Last          B       BC
4  4.1  10.0  D  First          C        D
5  5.1  11.0  E                 D       DE
6  6.1  12.0  E   Last          E       EE

However, what I'm trying to do with the Flowpath function is:

If Column Z is "First", Flowpath = Column Y

If Column Z is anything else, Flowpath = Previous Flowpath value + Column Y

Unless Column Y repeats the same value, in which case skip that row.

The output I am targeting is:

     W     X  Y      Z Previous_Y Flowpath
1  1.1   7.0  A  First        NaN        A
2  2.1   8.0  B                 A       AB
3  3.1   9.0  C   Last          B      ABC
4  4.1  10.0  D  First          C        D
5  5.1  11.0  E                 D       DE
6  6.1  12.0  E   Last          E       DE

To give context, these lines are steps in a manufacturing process, and I'm trying to describe the path materials take through a job shop. My data is a large number of customer orders and every step they took in the manufacturing process. Y is the manufacturing step, and column Z indicates the first and last step for each order. I'm using Knime to do the analysis but I can't find a node that will do this, so I'm trying to write a python script myself even though I'm quite the programming novice (as you can probably see). In my previous job, I would have done this in Alteryx using the Multi-Row node but I no longer have access to that software. I've spent a lot of time reading the Pandas documentation and I feel the solution is some combination of DataFrame.loc, DataFrame.shift, or DataFrame.cumsum, but I can't figure it out.

Any help would be greatly appreciated.

Answer 1

Iterate over the rows of your DataFrame and set the value of the Flowpath column following the logic you outline in the OP.

import pandas as pd

output_table = pd.DataFrame({'W' :[1.1, 2.1, 3.1, 4.1, 5.1, 6.1],
                             'X': [7., 8., 9., 10., 11., 12.],
                             'Y': ['A', 'B', 'C', 'D', 'E', 'E'],
                             'Z': ['First', ' ', 'Last', 'First', ' ', 'Last']},
                            index=range(1, 7))

output_table['Flowpath'] = ''

for idx in output_table.index:
    this_Z = output_table.loc[idx, 'Z']
    this_Y = output_table.loc[idx, 'Y']
    last_Y = output_table.loc[idx-1, 'Y'] if idx > 1 else ''
    last_Flowpath = output_table.loc[idx-1, 'Flowpath'] if idx > 1 else ''

    if this_Z == 'First':
        output_table.loc[idx, 'Flowpath'] = this_Y
    elif this_Y != last_Y:
        output_table.loc[idx, 'Flowpath'] = last_Flowpath + this_Y
    else:
        output_table.loc[idx, 'Flowpath'] = last_Flowpath

Answer 2

You can calculate a group variable by cumsum on the condition vector where Z is first to satisfy the first and second conditions and replace the same value as previous one with empty string so that you can do cumsum on the Y column which should give the expected output:

import pandas as pd
# calculate the group varaible
grp = (output_table.Z == "First").cumsum()

# calculate a condition vector where the current Y column is the same as the previous one
dup = output_table.Y.groupby(grp).apply(lambda g: g.shift() != g)

# replace the duplicated process in Y as empty string, group the column by the group variable
# calculated above and then do a cumulative sum
output_table['flowPath'] = output_table.Y.where(dup, "").groupby(grp).cumsum()

output_table

#     W X   Y       Z   flowPath
# 1 1.1 7   A   First          A
# 2 2.1 8   B                 AB
# 3 3.1 9   C   Last         ABC
# 4 4.1 10  D   First          D
# 5 5.1 11  E                 DE
# 6 6.1 12  E   Last          DE

Update : The above code works under 0.15.2 but not 0.18.1 , but a little bit tweaking with the last line as following can save it:

output_table['flowPath'] = output_table.Y.where(dup, "").groupby(grp).apply(pd.Series.cumsum)

Answer 3

for index, row in output_table.iterrows():
   prev_index = str(int(index) - 1)
   if row['Z'] == 'First':
       output_table.set_value(index, 'Flowpath', row['Y'])
   elif output_table['Y'][prev_index] == row['Y']:
       output_table.set_value(index, 'Flowpath', output_table['Flowpath'][prev_index])
   else:
       output_table.set_value(index, 'Flowpath', output_table['Flowpath'][prev_index] + row['Y'])

print output_table

     W     X  Y      Z Previous_Y Flowpath
1  1.1   7.0  A  First        NaN        A
2  2.1   8.0  B                 A       AB
3  3.1   9.0  C   Last          B      ABC
4  4.1  10.0  D  First          C        D
5  5.1  11.0  E                 D       DE
6  6.1  12.0  E   Last          E       DE

Answer 4

So bad things will happen if Z['1']!='First' but for your case this works. I understand you want something more Pandas-ish so I'm sorry that this answer is pretty plain python...

import pandas as pd
import numpy as np

input_table = {'W' : pd.Series([1.1,2.1,3.1,4.1,5.1,6.1], index = ['1','2','3','4','5','6']),
     'X' : pd.Series([7.,8.,9.,10.,11.,12.], index = ['1','2','3','4','5','6']),
     'Y' : pd.Series(['A','B','C','D','E','E'], index = ['1','2','3','4','5','6']),
     'Z' : pd.Series(['First',' ','Last','First',' ','Last'], index =['1','2','3','4','5','6'])}

ret = pd.Series([None,None,None,None,None,None], index = ['1','2','3','4','5','6'])
for k in [str(n) for n in range(1,7)]:
    if(input_table['Z'][k]=='First'):
        op = input_table['Y'][k]
    else:
        if(input_table['Y'][k]==input_table['Y'][str(int(k)-1)]):
            op = ret[str(int(k)-1)]
        else:
            op = ret[str(int(k)-1)]+input_table['Y'][k]
    ret[k]=op

input_table['Flowpath'] = ret
output_table = pd.DataFrame(input_table)
print output_table

Prints::

  Flowpath    W   X  Y      Z
1        A  1.1   7  A  First
2       AB  2.1   8  B       
3      ABC  3.1   9  C   Last
4        D  4.1  10  D  First
5       DE  5.1  11  E       
6       DE  6.1  12  E   Last

How to compare and then concatenate information from two different rows using python pandas data frames

Question

4 answers

solution1
1 2016-08-13 16:56:53

solution2
1 2016-08-13 17:07:33

solution3
1 2016-08-13 18:25:47

solution4
0 2016-08-13 17:04:54

How to compare and then concatenate information from two different rows using python pandas data frames

Question

4 answers

solution1 1 2016-08-13 16:56:53

solution2 1 2016-08-13 17:07:33

solution3 1 2016-08-13 18:25:47

solution4 0 2016-08-13 17:04:54

solution1
1 2016-08-13 16:56:53

solution2
1 2016-08-13 17:07:33

solution3
1 2016-08-13 18:25:47

solution4
0 2016-08-13 17:04:54