简体   繁体   中英

How to compare and then concatenate information from two different rows using python pandas data frames

I have written this code:

import pandas as pd
import numpy as np

input_table = {'W' : pd.Series([1.1,2.1,3.1,4.1,5.1,6.1], index = ['1','2','3','4','5','6']),
     'X' : pd.Series([7.,8.,9.,10.,11.,12.], index = ['1','2','3','4','5','6']),
     'Y' : pd.Series(['A','B','C','D','E','E'], index = ['1','2','3','4','5','6']),
     'Z' : pd.Series(['First',' ','Last','First',' ','Last'], ['1','2','3','4','5','6'])}

output_table = pd.DataFrame(input_table)

output_table['Previous_Y'] = output_table['Y']

output_table.Previous_Y = output_table.Previous_Y.shift(1)

def Calc_flowpath(x):
    if x['Z'] == 'First':
        return x['Y']
    else:
        return x['Previous_Y'] + x['Y']           

output_table['Flowpath'] = output_table.apply(Calc_flowpath, axis=1)

print output_table

And my output is (as expected):

     W     X  Y      Z Previous_Y Flowpath
1  1.1   7.0  A  First        NaN        A
2  2.1   8.0  B                 A       AB
3  3.1   9.0  C   Last          B       BC
4  4.1  10.0  D  First          C        D
5  5.1  11.0  E                 D       DE
6  6.1  12.0  E   Last          E       EE

However, what I'm trying to do with the Flowpath function is:

If Column Z is "First", Flowpath = Column Y

If Column Z is anything else, Flowpath = Previous Flowpath value + Column Y

Unless Column Y repeats the same value, in which case skip that row.

The output I am targeting is:

     W     X  Y      Z Previous_Y Flowpath
1  1.1   7.0  A  First        NaN        A
2  2.1   8.0  B                 A       AB
3  3.1   9.0  C   Last          B      ABC
4  4.1  10.0  D  First          C        D
5  5.1  11.0  E                 D       DE
6  6.1  12.0  E   Last          E       DE

To give context, these lines are steps in a manufacturing process, and I'm trying to describe the path materials take through a job shop. My data is a large number of customer orders and every step they took in the manufacturing process. Y is the manufacturing step, and column Z indicates the first and last step for each order. I'm using Knime to do the analysis but I can't find a node that will do this, so I'm trying to write a python script myself even though I'm quite the programming novice (as you can probably see). In my previous job, I would have done this in Alteryx using the Multi-Row node but I no longer have access to that software. I've spent a lot of time reading the Pandas documentation and I feel the solution is some combination of DataFrame.loc, DataFrame.shift, or DataFrame.cumsum, but I can't figure it out.

Any help would be greatly appreciated.

Iterate over the rows of your DataFrame and set the value of the Flowpath column following the logic you outline in the OP.

import pandas as pd

output_table = pd.DataFrame({'W' :[1.1, 2.1, 3.1, 4.1, 5.1, 6.1],
                             'X': [7., 8., 9., 10., 11., 12.],
                             'Y': ['A', 'B', 'C', 'D', 'E', 'E'],
                             'Z': ['First', ' ', 'Last', 'First', ' ', 'Last']},
                            index=range(1, 7))

output_table['Flowpath'] = ''

for idx in output_table.index:
    this_Z = output_table.loc[idx, 'Z']
    this_Y = output_table.loc[idx, 'Y']
    last_Y = output_table.loc[idx-1, 'Y'] if idx > 1 else ''
    last_Flowpath = output_table.loc[idx-1, 'Flowpath'] if idx > 1 else ''

    if this_Z == 'First':
        output_table.loc[idx, 'Flowpath'] = this_Y
    elif this_Y != last_Y:
        output_table.loc[idx, 'Flowpath'] = last_Flowpath + this_Y
    else:
        output_table.loc[idx, 'Flowpath'] = last_Flowpath

You can calculate a group variable by cumsum on the condition vector where Z is first to satisfy the first and second conditions and replace the same value as previous one with empty string so that you can do cumsum on the Y column which should give the expected output:

import pandas as pd
# calculate the group varaible
grp = (output_table.Z == "First").cumsum()

# calculate a condition vector where the current Y column is the same as the previous one
dup = output_table.Y.groupby(grp).apply(lambda g: g.shift() != g)

# replace the duplicated process in Y as empty string, group the column by the group variable
# calculated above and then do a cumulative sum
output_table['flowPath'] = output_table.Y.where(dup, "").groupby(grp).cumsum()

output_table

#     W X   Y       Z   flowPath
# 1 1.1 7   A   First          A
# 2 2.1 8   B                 AB
# 3 3.1 9   C   Last         ABC
# 4 4.1 10  D   First          D
# 5 5.1 11  E                 DE
# 6 6.1 12  E   Last          DE

Update : The above code works under 0.15.2 but not 0.18.1 , but a little bit tweaking with the last line as following can save it:

output_table['flowPath'] = output_table.Y.where(dup, "").groupby(grp).apply(pd.Series.cumsum)
for index, row in output_table.iterrows():
   prev_index = str(int(index) - 1)
   if row['Z'] == 'First':
       output_table.set_value(index, 'Flowpath', row['Y'])
   elif output_table['Y'][prev_index] == row['Y']:
       output_table.set_value(index, 'Flowpath', output_table['Flowpath'][prev_index])
   else:
       output_table.set_value(index, 'Flowpath', output_table['Flowpath'][prev_index] + row['Y'])

print output_table

     W     X  Y      Z Previous_Y Flowpath
1  1.1   7.0  A  First        NaN        A
2  2.1   8.0  B                 A       AB
3  3.1   9.0  C   Last          B      ABC
4  4.1  10.0  D  First          C        D
5  5.1  11.0  E                 D       DE
6  6.1  12.0  E   Last          E       DE

So bad things will happen if Z['1']!='First' but for your case this works. I understand you want something more Pandas-ish so I'm sorry that this answer is pretty plain python...

import pandas as pd
import numpy as np

input_table = {'W' : pd.Series([1.1,2.1,3.1,4.1,5.1,6.1], index = ['1','2','3','4','5','6']),
     'X' : pd.Series([7.,8.,9.,10.,11.,12.], index = ['1','2','3','4','5','6']),
     'Y' : pd.Series(['A','B','C','D','E','E'], index = ['1','2','3','4','5','6']),
     'Z' : pd.Series(['First',' ','Last','First',' ','Last'], index =['1','2','3','4','5','6'])}

ret = pd.Series([None,None,None,None,None,None], index = ['1','2','3','4','5','6'])
for k in [str(n) for n in range(1,7)]:
    if(input_table['Z'][k]=='First'):
        op = input_table['Y'][k]
    else:
        if(input_table['Y'][k]==input_table['Y'][str(int(k)-1)]):
            op = ret[str(int(k)-1)]
        else:
            op = ret[str(int(k)-1)]+input_table['Y'][k]
    ret[k]=op

input_table['Flowpath'] = ret
output_table = pd.DataFrame(input_table)
print output_table

Prints::

  Flowpath    W   X  Y      Z
1        A  1.1   7  A  First
2       AB  2.1   8  B       
3      ABC  3.1   9  C   Last
4        D  4.1  10  D  First
5       DE  5.1  11  E       
6       DE  6.1  12  E   Last

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM