如何使用python pandas数据帧比较然后连接来自两个不同行的信息

Question

I have written this code: 我写了这段代码：

import pandas as pd
import numpy as np

input_table = {'W' : pd.Series([1.1,2.1,3.1,4.1,5.1,6.1], index = ['1','2','3','4','5','6']),
     'X' : pd.Series([7.,8.,9.,10.,11.,12.], index = ['1','2','3','4','5','6']),
     'Y' : pd.Series(['A','B','C','D','E','E'], index = ['1','2','3','4','5','6']),
     'Z' : pd.Series(['First',' ','Last','First',' ','Last'], ['1','2','3','4','5','6'])}

output_table = pd.DataFrame(input_table)

output_table['Previous_Y'] = output_table['Y']

output_table.Previous_Y = output_table.Previous_Y.shift(1)

def Calc_flowpath(x):
    if x['Z'] == 'First':
        return x['Y']
    else:
        return x['Previous_Y'] + x['Y']           

output_table['Flowpath'] = output_table.apply(Calc_flowpath, axis=1)

print output_table

And my output is (as expected): 我的输出是（如预期的那样）：

     W     X  Y      Z Previous_Y Flowpath
1  1.1   7.0  A  First        NaN        A
2  2.1   8.0  B                 A       AB
3  3.1   9.0  C   Last          B       BC
4  4.1  10.0  D  First          C        D
5  5.1  11.0  E                 D       DE
6  6.1  12.0  E   Last          E       EE

However, what I'm trying to do with the Flowpath function is: 但是，我正在尝试使用Flowpath函数的是：

If Column Z is "First", Flowpath = Column Y 如果Z列为“第一”，则流路= Y列

If Column Z is anything else, Flowpath = Previous Flowpath value + Column Y 如果Z列是其他任何内容，则Flowpath =先前的Flowpath值+ Column Y

Unless Column Y repeats the same value, in which case skip that row. 除非Y列重复相同的值，否则将跳过该行。

The output I am targeting is: 我的目标输出是：

     W     X  Y      Z Previous_Y Flowpath
1  1.1   7.0  A  First        NaN        A
2  2.1   8.0  B                 A       AB
3  3.1   9.0  C   Last          B      ABC
4  4.1  10.0  D  First          C        D
5  5.1  11.0  E                 D       DE
6  6.1  12.0  E   Last          E       DE

To give context, these lines are steps in a manufacturing process, and I'm trying to describe the path materials take through a job shop. 为了说明背景，这些线是制造过程中的步骤，我正在尝试描述材料通过车间的路径。 My data is a large number of customer orders and every step they took in the manufacturing process. 我的数据是大量的客户订单以及他们在制造过程中采取的每个步骤。 Y is the manufacturing step, and column Z indicates the first and last step for each order. Y是制造步骤，列Z表示每个订单的第一步和最后一步。 I'm using Knime to do the analysis but I can't find a node that will do this, so I'm trying to write a python script myself even though I'm quite the programming novice (as you can probably see). 我正在使用Knime进行分析，但是找不到能够执行此操作的节点，因此即使我是编程新手（如您可能会看到的），我还是试图自己编写python脚本。 In my previous job, I would have done this in Alteryx using the Multi-Row node but I no longer have access to that software. 在我之前的工作中，我将使用Multi-Row节点在Alteryx中完成此操作，但是我将无法再访问该软件。 I've spent a lot of time reading the Pandas documentation and I feel the solution is some combination of DataFrame.loc, DataFrame.shift, or DataFrame.cumsum, but I can't figure it out. 我已经花了很多时间阅读Pandas文档，并且我认为解决方案是DataFrame.loc，DataFrame.shift或DataFrame.cumsum的某种组合，但我无法弄清楚。

Any help would be greatly appreciated. 任何帮助将不胜感激。

Answer 1

Iterate over the rows of your DataFrame and set the value of the Flowpath column following the logic you outline in the OP. 遍历DataFrame的行，并按照OP中概述的逻辑设置Flowpath列的值。

import pandas as pd

output_table = pd.DataFrame({'W' :[1.1, 2.1, 3.1, 4.1, 5.1, 6.1],
                             'X': [7., 8., 9., 10., 11., 12.],
                             'Y': ['A', 'B', 'C', 'D', 'E', 'E'],
                             'Z': ['First', ' ', 'Last', 'First', ' ', 'Last']},
                            index=range(1, 7))

output_table['Flowpath'] = ''

for idx in output_table.index:
    this_Z = output_table.loc[idx, 'Z']
    this_Y = output_table.loc[idx, 'Y']
    last_Y = output_table.loc[idx-1, 'Y'] if idx > 1 else ''
    last_Flowpath = output_table.loc[idx-1, 'Flowpath'] if idx > 1 else ''

    if this_Z == 'First':
        output_table.loc[idx, 'Flowpath'] = this_Y
    elif this_Y != last_Y:
        output_table.loc[idx, 'Flowpath'] = last_Flowpath + this_Y
    else:
        output_table.loc[idx, 'Flowpath'] = last_Flowpath

Answer 2

You can calculate a group variable by cumsum on the condition vector where Z is first to satisfy the first and second conditions and replace the same value as previous one with empty string so that you can do cumsum on the Y column which should give the expected output: 您可以在条件向量中按cumsum来计算组变量，其中Z first满足第一个和第二个条件，并将与前一个相同的值替换为空字符串，以便您可以在Y列上进行cumsum ，这将提供预期的输出：

import pandas as pd
# calculate the group varaible
grp = (output_table.Z == "First").cumsum()

# calculate a condition vector where the current Y column is the same as the previous one
dup = output_table.Y.groupby(grp).apply(lambda g: g.shift() != g)

# replace the duplicated process in Y as empty string, group the column by the group variable
# calculated above and then do a cumulative sum
output_table['flowPath'] = output_table.Y.where(dup, "").groupby(grp).cumsum()

output_table

#     W X   Y       Z   flowPath
# 1 1.1 7   A   First          A
# 2 2.1 8   B                 AB
# 3 3.1 9   C   Last         ABC
# 4 4.1 10  D   First          D
# 5 5.1 11  E                 DE
# 6 6.1 12  E   Last          DE

Update : The above code works under 0.15.2 but not 0.18.1 , but a little bit tweaking with the last line as following can save it: 更新：上面的代码在0.15.2而不是0.18.1 ，但是对最后一行进行一些调整，如下所示可以保存它：

output_table['flowPath'] = output_table.Y.where(dup, "").groupby(grp).apply(pd.Series.cumsum)

Answer 3

for index, row in output_table.iterrows():
   prev_index = str(int(index) - 1)
   if row['Z'] == 'First':
       output_table.set_value(index, 'Flowpath', row['Y'])
   elif output_table['Y'][prev_index] == row['Y']:
       output_table.set_value(index, 'Flowpath', output_table['Flowpath'][prev_index])
   else:
       output_table.set_value(index, 'Flowpath', output_table['Flowpath'][prev_index] + row['Y'])

print output_table

     W     X  Y      Z Previous_Y Flowpath
1  1.1   7.0  A  First        NaN        A
2  2.1   8.0  B                 A       AB
3  3.1   9.0  C   Last          B      ABC
4  4.1  10.0  D  First          C        D
5  5.1  11.0  E                 D       DE
6  6.1  12.0  E   Last          E       DE

Answer 4

So bad things will happen if Z['1']!='First' but for your case this works. 因此，如果Z['1']!='First'会发生不好的事情，但对于您而言，这是可行的。 I understand you want something more Pandas-ish so I'm sorry that this answer is pretty plain python... 我了解您想要更多的Pandas式的内容，所以很抱歉这个答案很简单。

import pandas as pd
import numpy as np

input_table = {'W' : pd.Series([1.1,2.1,3.1,4.1,5.1,6.1], index = ['1','2','3','4','5','6']),
     'X' : pd.Series([7.,8.,9.,10.,11.,12.], index = ['1','2','3','4','5','6']),
     'Y' : pd.Series(['A','B','C','D','E','E'], index = ['1','2','3','4','5','6']),
     'Z' : pd.Series(['First',' ','Last','First',' ','Last'], index =['1','2','3','4','5','6'])}

ret = pd.Series([None,None,None,None,None,None], index = ['1','2','3','4','5','6'])
for k in [str(n) for n in range(1,7)]:
    if(input_table['Z'][k]=='First'):
        op = input_table['Y'][k]
    else:
        if(input_table['Y'][k]==input_table['Y'][str(int(k)-1)]):
            op = ret[str(int(k)-1)]
        else:
            op = ret[str(int(k)-1)]+input_table['Y'][k]
    ret[k]=op

input_table['Flowpath'] = ret
output_table = pd.DataFrame(input_table)
print output_table

Prints:: 打印::

  Flowpath    W   X  Y      Z
1        A  1.1   7  A  First
2       AB  2.1   8  B       
3      ABC  3.1   9  C   Last
4        D  4.1  10  D  First
5       DE  5.1  11  E       
6       DE  6.1  12  E   Last

如何使用python pandas数据帧比较然后连接来自两个不同行的信息

问题描述

4 个解决方案

解决方案1
1 2016-08-13 16:56:53

解决方案2
1 2016-08-13 17:07:33

解决方案3
1 2016-08-13 18:25:47

解决方案4
0 2016-08-13 17:04:54

如何使用python pandas数据帧比较然后连接来自两个不同行的信息

问题描述

4 个解决方案

解决方案1 1 2016-08-13 16:56:53

解决方案2 1 2016-08-13 17:07:33

解决方案3 1 2016-08-13 18:25:47

解决方案4 0 2016-08-13 17:04:54

解决方案1
1 2016-08-13 16:56:53

解决方案2
1 2016-08-13 17:07:33

解决方案3
1 2016-08-13 18:25:47

解决方案4
0 2016-08-13 17:04:54