如何使用python pandas數據幀比較然后連接來自兩個不同行的信息

Question

我寫了這段代碼：

import pandas as pd
import numpy as np

input_table = {'W' : pd.Series([1.1,2.1,3.1,4.1,5.1,6.1], index = ['1','2','3','4','5','6']),
     'X' : pd.Series([7.,8.,9.,10.,11.,12.], index = ['1','2','3','4','5','6']),
     'Y' : pd.Series(['A','B','C','D','E','E'], index = ['1','2','3','4','5','6']),
     'Z' : pd.Series(['First',' ','Last','First',' ','Last'], ['1','2','3','4','5','6'])}

output_table = pd.DataFrame(input_table)

output_table['Previous_Y'] = output_table['Y']

output_table.Previous_Y = output_table.Previous_Y.shift(1)

def Calc_flowpath(x):
    if x['Z'] == 'First':
        return x['Y']
    else:
        return x['Previous_Y'] + x['Y']           

output_table['Flowpath'] = output_table.apply(Calc_flowpath, axis=1)

print output_table

我的輸出是（如預期的那樣）：

     W     X  Y      Z Previous_Y Flowpath
1  1.1   7.0  A  First        NaN        A
2  2.1   8.0  B                 A       AB
3  3.1   9.0  C   Last          B       BC
4  4.1  10.0  D  First          C        D
5  5.1  11.0  E                 D       DE
6  6.1  12.0  E   Last          E       EE

但是，我正在嘗試使用Flowpath函數的是：

如果Z列為“第一”，則流路= Y列

如果Z列是其他任何內容，則Flowpath =先前的Flowpath值+ Column Y

除非Y列重復相同的值，否則將跳過該行。

我的目標輸出是：

     W     X  Y      Z Previous_Y Flowpath
1  1.1   7.0  A  First        NaN        A
2  2.1   8.0  B                 A       AB
3  3.1   9.0  C   Last          B      ABC
4  4.1  10.0  D  First          C        D
5  5.1  11.0  E                 D       DE
6  6.1  12.0  E   Last          E       DE

為了說明背景，這些線是制造過程中的步驟，我正在嘗試描述材料通過車間的路徑。 我的數據是大量的客戶訂單以及他們在制造過程中采取的每個步驟。 Y是制造步驟，列Z表示每個訂單的第一步和最后一步。 我正在使用Knime進行分析，但是找不到能夠執行此操作的節點，因此即使我是編程新手（如您可能會看到的），我還是試圖自己編寫python腳本。 在我之前的工作中，我將使用Multi-Row節點在Alteryx中完成此操作，但是我將無法再訪問該軟件。 我已經花了很多時間閱讀Pandas文檔，並且我認為解決方案是DataFrame.loc，DataFrame.shift或DataFrame.cumsum的某種組合，但我無法弄清楚。

任何幫助將不勝感激。

Answer 1

遍歷DataFrame的行，並按照OP中概述的邏輯設置Flowpath列的值。

import pandas as pd

output_table = pd.DataFrame({'W' :[1.1, 2.1, 3.1, 4.1, 5.1, 6.1],
                             'X': [7., 8., 9., 10., 11., 12.],
                             'Y': ['A', 'B', 'C', 'D', 'E', 'E'],
                             'Z': ['First', ' ', 'Last', 'First', ' ', 'Last']},
                            index=range(1, 7))

output_table['Flowpath'] = ''

for idx in output_table.index:
    this_Z = output_table.loc[idx, 'Z']
    this_Y = output_table.loc[idx, 'Y']
    last_Y = output_table.loc[idx-1, 'Y'] if idx > 1 else ''
    last_Flowpath = output_table.loc[idx-1, 'Flowpath'] if idx > 1 else ''

    if this_Z == 'First':
        output_table.loc[idx, 'Flowpath'] = this_Y
    elif this_Y != last_Y:
        output_table.loc[idx, 'Flowpath'] = last_Flowpath + this_Y
    else:
        output_table.loc[idx, 'Flowpath'] = last_Flowpath

Answer 2

您可以在條件向量中按cumsum來計算組變量，其中Z first滿足第一個和第二個條件，並將與前一個相同的值替換為空字符串，以便您可以在Y列上進行cumsum ，這將提供預期的輸出：

import pandas as pd
# calculate the group varaible
grp = (output_table.Z == "First").cumsum()

# calculate a condition vector where the current Y column is the same as the previous one
dup = output_table.Y.groupby(grp).apply(lambda g: g.shift() != g)

# replace the duplicated process in Y as empty string, group the column by the group variable
# calculated above and then do a cumulative sum
output_table['flowPath'] = output_table.Y.where(dup, "").groupby(grp).cumsum()

output_table

#     W X   Y       Z   flowPath
# 1 1.1 7   A   First          A
# 2 2.1 8   B                 AB
# 3 3.1 9   C   Last         ABC
# 4 4.1 10  D   First          D
# 5 5.1 11  E                 DE
# 6 6.1 12  E   Last          DE

更新：上面的代碼在0.15.2而不是0.18.1 ，但是對最后一行進行一些調整，如下所示可以保存它：

output_table['flowPath'] = output_table.Y.where(dup, "").groupby(grp).apply(pd.Series.cumsum)

Answer 3

for index, row in output_table.iterrows():
   prev_index = str(int(index) - 1)
   if row['Z'] == 'First':
       output_table.set_value(index, 'Flowpath', row['Y'])
   elif output_table['Y'][prev_index] == row['Y']:
       output_table.set_value(index, 'Flowpath', output_table['Flowpath'][prev_index])
   else:
       output_table.set_value(index, 'Flowpath', output_table['Flowpath'][prev_index] + row['Y'])

print output_table

     W     X  Y      Z Previous_Y Flowpath
1  1.1   7.0  A  First        NaN        A
2  2.1   8.0  B                 A       AB
3  3.1   9.0  C   Last          B      ABC
4  4.1  10.0  D  First          C        D
5  5.1  11.0  E                 D       DE
6  6.1  12.0  E   Last          E       DE

Answer 4

因此，如果Z['1']!='First'會發生不好的事情，但對於您而言，這是可行的。 我了解您想要更多的Pandas式的內容，所以很抱歉這個答案很簡單。

import pandas as pd
import numpy as np

input_table = {'W' : pd.Series([1.1,2.1,3.1,4.1,5.1,6.1], index = ['1','2','3','4','5','6']),
     'X' : pd.Series([7.,8.,9.,10.,11.,12.], index = ['1','2','3','4','5','6']),
     'Y' : pd.Series(['A','B','C','D','E','E'], index = ['1','2','3','4','5','6']),
     'Z' : pd.Series(['First',' ','Last','First',' ','Last'], index =['1','2','3','4','5','6'])}

ret = pd.Series([None,None,None,None,None,None], index = ['1','2','3','4','5','6'])
for k in [str(n) for n in range(1,7)]:
    if(input_table['Z'][k]=='First'):
        op = input_table['Y'][k]
    else:
        if(input_table['Y'][k]==input_table['Y'][str(int(k)-1)]):
            op = ret[str(int(k)-1)]
        else:
            op = ret[str(int(k)-1)]+input_table['Y'][k]
    ret[k]=op

input_table['Flowpath'] = ret
output_table = pd.DataFrame(input_table)
print output_table

打印::

  Flowpath    W   X  Y      Z
1        A  1.1   7  A  First
2       AB  2.1   8  B       
3      ABC  3.1   9  C   Last
4        D  4.1  10  D  First
5       DE  5.1  11  E       
6       DE  6.1  12  E   Last

如何使用python pandas數據幀比較然后連接來自兩個不同行的信息

問題描述

4 個解決方案

解決方案1
1 2016-08-13 16:56:53

解決方案2
1 2016-08-13 17:07:33

解決方案3
1 2016-08-13 18:25:47

解決方案4
0 2016-08-13 17:04:54

如何使用python pandas數據幀比較然后連接來自兩個不同行的信息

問題描述

4 個解決方案

解決方案1 1 2016-08-13 16:56:53

解決方案2 1 2016-08-13 17:07:33

解決方案3 1 2016-08-13 18:25:47

解決方案4 0 2016-08-13 17:04:54

解決方案1
1 2016-08-13 16:56:53

解決方案2
1 2016-08-13 17:07:33

解決方案3
1 2016-08-13 18:25:47

解決方案4
0 2016-08-13 17:04:54