简体   繁体   English

如何使用python pandas数据帧比较然后连接来自两个不同行的信息

[英]How to compare and then concatenate information from two different rows using python pandas data frames

I have written this code: 我写了这段代码:

import pandas as pd
import numpy as np

input_table = {'W' : pd.Series([1.1,2.1,3.1,4.1,5.1,6.1], index = ['1','2','3','4','5','6']),
     'X' : pd.Series([7.,8.,9.,10.,11.,12.], index = ['1','2','3','4','5','6']),
     'Y' : pd.Series(['A','B','C','D','E','E'], index = ['1','2','3','4','5','6']),
     'Z' : pd.Series(['First',' ','Last','First',' ','Last'], ['1','2','3','4','5','6'])}

output_table = pd.DataFrame(input_table)

output_table['Previous_Y'] = output_table['Y']

output_table.Previous_Y = output_table.Previous_Y.shift(1)

def Calc_flowpath(x):
    if x['Z'] == 'First':
        return x['Y']
    else:
        return x['Previous_Y'] + x['Y']           

output_table['Flowpath'] = output_table.apply(Calc_flowpath, axis=1)

print output_table

And my output is (as expected): 我的输出是(如预期的那样):

     W     X  Y      Z Previous_Y Flowpath
1  1.1   7.0  A  First        NaN        A
2  2.1   8.0  B                 A       AB
3  3.1   9.0  C   Last          B       BC
4  4.1  10.0  D  First          C        D
5  5.1  11.0  E                 D       DE
6  6.1  12.0  E   Last          E       EE

However, what I'm trying to do with the Flowpath function is: 但是,我正在尝试使用Flowpath函数的是:

If Column Z is "First", Flowpath = Column Y 如果Z列为“第一”,则流路= Y列

If Column Z is anything else, Flowpath = Previous Flowpath value + Column Y 如果Z列是其他任何内容,则Flowpath =先前的Flowpath值+ Column Y

Unless Column Y repeats the same value, in which case skip that row. 除非Y列重复相同的值,否则将跳过该行。

The output I am targeting is: 我的目标输出是:

     W     X  Y      Z Previous_Y Flowpath
1  1.1   7.0  A  First        NaN        A
2  2.1   8.0  B                 A       AB
3  3.1   9.0  C   Last          B      ABC
4  4.1  10.0  D  First          C        D
5  5.1  11.0  E                 D       DE
6  6.1  12.0  E   Last          E       DE

To give context, these lines are steps in a manufacturing process, and I'm trying to describe the path materials take through a job shop. 为了说明背景,这些线是制造过程中的步骤,我正在尝试描述材料通过车间的路径。 My data is a large number of customer orders and every step they took in the manufacturing process. 我的数据是大量的客户订单以及他们在制造过程中采取的每个步骤。 Y is the manufacturing step, and column Z indicates the first and last step for each order. Y是制造步骤,列Z表示每个订单的第一步和最后一步。 I'm using Knime to do the analysis but I can't find a node that will do this, so I'm trying to write a python script myself even though I'm quite the programming novice (as you can probably see). 我正在使用Knime进行分析,但是找不到能够执行此操作的节点,因此即使我是编程新手(如您可能会看到的),我还是试图自己编写python脚本。 In my previous job, I would have done this in Alteryx using the Multi-Row node but I no longer have access to that software. 在我之前的工作中,我将使用Multi-Row节点在Alteryx中完成此操作,但是我将无法再访问该软件。 I've spent a lot of time reading the Pandas documentation and I feel the solution is some combination of DataFrame.loc, DataFrame.shift, or DataFrame.cumsum, but I can't figure it out. 我已经花了很多时间阅读Pandas文档,并且我认为解决方案是DataFrame.loc,DataFrame.shift或DataFrame.cumsum的某种组合,但我无法弄清楚。

Any help would be greatly appreciated. 任何帮助将不胜感激。

Iterate over the rows of your DataFrame and set the value of the Flowpath column following the logic you outline in the OP. 遍历DataFrame的行,并按照OP中概述的逻辑设置Flowpath列的值。

import pandas as pd

output_table = pd.DataFrame({'W' :[1.1, 2.1, 3.1, 4.1, 5.1, 6.1],
                             'X': [7., 8., 9., 10., 11., 12.],
                             'Y': ['A', 'B', 'C', 'D', 'E', 'E'],
                             'Z': ['First', ' ', 'Last', 'First', ' ', 'Last']},
                            index=range(1, 7))

output_table['Flowpath'] = ''

for idx in output_table.index:
    this_Z = output_table.loc[idx, 'Z']
    this_Y = output_table.loc[idx, 'Y']
    last_Y = output_table.loc[idx-1, 'Y'] if idx > 1 else ''
    last_Flowpath = output_table.loc[idx-1, 'Flowpath'] if idx > 1 else ''

    if this_Z == 'First':
        output_table.loc[idx, 'Flowpath'] = this_Y
    elif this_Y != last_Y:
        output_table.loc[idx, 'Flowpath'] = last_Flowpath + this_Y
    else:
        output_table.loc[idx, 'Flowpath'] = last_Flowpath

You can calculate a group variable by cumsum on the condition vector where Z is first to satisfy the first and second conditions and replace the same value as previous one with empty string so that you can do cumsum on the Y column which should give the expected output: 您可以在条件向量中按cumsum来计算组变量,其中Z first满足第一个和第二个条件,并将与前一个相同的值替换为空字符串,以便您可以在Y列上进行cumsum ,这将提供预期的输出:

import pandas as pd
# calculate the group varaible
grp = (output_table.Z == "First").cumsum()

# calculate a condition vector where the current Y column is the same as the previous one
dup = output_table.Y.groupby(grp).apply(lambda g: g.shift() != g)

# replace the duplicated process in Y as empty string, group the column by the group variable
# calculated above and then do a cumulative sum
output_table['flowPath'] = output_table.Y.where(dup, "").groupby(grp).cumsum()

output_table

#     W X   Y       Z   flowPath
# 1 1.1 7   A   First          A
# 2 2.1 8   B                 AB
# 3 3.1 9   C   Last         ABC
# 4 4.1 10  D   First          D
# 5 5.1 11  E                 DE
# 6 6.1 12  E   Last          DE

Update : The above code works under 0.15.2 but not 0.18.1 , but a little bit tweaking with the last line as following can save it: 更新 :上面的代码在0.15.2而不是0.18.1 ,但是对最后一行进行一些调整,如下所示可以保存它:

output_table['flowPath'] = output_table.Y.where(dup, "").groupby(grp).apply(pd.Series.cumsum)
for index, row in output_table.iterrows():
   prev_index = str(int(index) - 1)
   if row['Z'] == 'First':
       output_table.set_value(index, 'Flowpath', row['Y'])
   elif output_table['Y'][prev_index] == row['Y']:
       output_table.set_value(index, 'Flowpath', output_table['Flowpath'][prev_index])
   else:
       output_table.set_value(index, 'Flowpath', output_table['Flowpath'][prev_index] + row['Y'])

print output_table

     W     X  Y      Z Previous_Y Flowpath
1  1.1   7.0  A  First        NaN        A
2  2.1   8.0  B                 A       AB
3  3.1   9.0  C   Last          B      ABC
4  4.1  10.0  D  First          C        D
5  5.1  11.0  E                 D       DE
6  6.1  12.0  E   Last          E       DE

So bad things will happen if Z['1']!='First' but for your case this works. 因此,如果Z['1']!='First'会发生不好的事情,但对于您而言,这是可行的。 I understand you want something more Pandas-ish so I'm sorry that this answer is pretty plain python... 我了解您想要更多的Pandas式的内容,所以很抱歉这个答案很简单。

import pandas as pd
import numpy as np

input_table = {'W' : pd.Series([1.1,2.1,3.1,4.1,5.1,6.1], index = ['1','2','3','4','5','6']),
     'X' : pd.Series([7.,8.,9.,10.,11.,12.], index = ['1','2','3','4','5','6']),
     'Y' : pd.Series(['A','B','C','D','E','E'], index = ['1','2','3','4','5','6']),
     'Z' : pd.Series(['First',' ','Last','First',' ','Last'], index =['1','2','3','4','5','6'])}

ret = pd.Series([None,None,None,None,None,None], index = ['1','2','3','4','5','6'])
for k in [str(n) for n in range(1,7)]:
    if(input_table['Z'][k]=='First'):
        op = input_table['Y'][k]
    else:
        if(input_table['Y'][k]==input_table['Y'][str(int(k)-1)]):
            op = ret[str(int(k)-1)]
        else:
            op = ret[str(int(k)-1)]+input_table['Y'][k]
    ret[k]=op

input_table['Flowpath'] = ret
output_table = pd.DataFrame(input_table)
print output_table

Prints:: 打印::

  Flowpath    W   X  Y      Z
1        A  1.1   7  A  First
2       AB  2.1   8  B       
3      ABC  3.1   9  C   Last
4        D  4.1  10  D  First
5       DE  5.1  11  E       
6       DE  6.1  12  E   Last

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将两个pandas数据帧连接在一起(在python中) - Concatenate two pandas data frames together (in python) Python Pandas - Concat两个具有不同行数和列数的数据帧 - Python Pandas - Concat two data frames with different number of rows and columns 如何比较具有相同列但行数不同的两个数据帧? - How to compare two data frames with same columns but different number of rows? 如何从Python中两个不同数据框中的两个不同索引中提取信息? - How do I do extract information from two different indices in two different data frames in Python? 如何比较python中两个不同数据框的列? - how to compare columns of two different data frames in python? Python Pandas:比较一列中的两个数据帧,并返回另一个数据帧中两个数据帧的行内容 - Python Pandas : compare two data-frames along one column and return content of rows of both data frames in another data frame 比较两个不同的熊猫数据框中的两列值 - compare two columns values in two different pandas data frames 如何在 pandas 中连接两个具有不同列名的数据帧? - python - how to concat two data frames with different column names in pandas? - python 如何在pandas中连接两个具有不同列数的帧? - How to concatenate two frames with different number of columns in pandas? Python:如何比较两个数据框 - Python : How to compare two data frames
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM