[英]How to compare and then concatenate information from two different rows using python pandas data frames
I have written this code: 我写了这段代码:
import pandas as pd
import numpy as np
input_table = {'W' : pd.Series([1.1,2.1,3.1,4.1,5.1,6.1], index = ['1','2','3','4','5','6']),
'X' : pd.Series([7.,8.,9.,10.,11.,12.], index = ['1','2','3','4','5','6']),
'Y' : pd.Series(['A','B','C','D','E','E'], index = ['1','2','3','4','5','6']),
'Z' : pd.Series(['First',' ','Last','First',' ','Last'], ['1','2','3','4','5','6'])}
output_table = pd.DataFrame(input_table)
output_table['Previous_Y'] = output_table['Y']
output_table.Previous_Y = output_table.Previous_Y.shift(1)
def Calc_flowpath(x):
if x['Z'] == 'First':
return x['Y']
else:
return x['Previous_Y'] + x['Y']
output_table['Flowpath'] = output_table.apply(Calc_flowpath, axis=1)
print output_table
And my output is (as expected): 我的输出是(如预期的那样):
W X Y Z Previous_Y Flowpath
1 1.1 7.0 A First NaN A
2 2.1 8.0 B A AB
3 3.1 9.0 C Last B BC
4 4.1 10.0 D First C D
5 5.1 11.0 E D DE
6 6.1 12.0 E Last E EE
However, what I'm trying to do with the Flowpath function is: 但是,我正在尝试使用Flowpath函数的是:
If Column Z is "First", Flowpath = Column Y
如果Z列为“第一”,则流路= Y列
If Column Z is anything else, Flowpath = Previous Flowpath value + Column Y
如果Z列是其他任何内容,则Flowpath =先前的Flowpath值+ Column Y
Unless Column Y repeats the same value, in which case skip that row.
除非Y列重复相同的值,否则将跳过该行。
The output I am targeting is: 我的目标输出是:
W X Y Z Previous_Y Flowpath
1 1.1 7.0 A First NaN A
2 2.1 8.0 B A AB
3 3.1 9.0 C Last B ABC
4 4.1 10.0 D First C D
5 5.1 11.0 E D DE
6 6.1 12.0 E Last E DE
To give context, these lines are steps in a manufacturing process, and I'm trying to describe the path materials take through a job shop. 为了说明背景,这些线是制造过程中的步骤,我正在尝试描述材料通过车间的路径。 My data is a large number of customer orders and every step they took in the manufacturing process.
我的数据是大量的客户订单以及他们在制造过程中采取的每个步骤。 Y is the manufacturing step, and column Z indicates the first and last step for each order.
Y是制造步骤,列Z表示每个订单的第一步和最后一步。 I'm using Knime to do the analysis but I can't find a node that will do this, so I'm trying to write a python script myself even though I'm quite the programming novice (as you can probably see).
我正在使用Knime进行分析,但是找不到能够执行此操作的节点,因此即使我是编程新手(如您可能会看到的),我还是试图自己编写python脚本。 In my previous job, I would have done this in Alteryx using the Multi-Row node but I no longer have access to that software.
在我之前的工作中,我将使用Multi-Row节点在Alteryx中完成此操作,但是我将无法再访问该软件。 I've spent a lot of time reading the Pandas documentation and I feel the solution is some combination of DataFrame.loc, DataFrame.shift, or DataFrame.cumsum, but I can't figure it out.
我已经花了很多时间阅读Pandas文档,并且我认为解决方案是DataFrame.loc,DataFrame.shift或DataFrame.cumsum的某种组合,但我无法弄清楚。
Any help would be greatly appreciated. 任何帮助将不胜感激。
Iterate over the rows of your DataFrame and set the value of the Flowpath
column following the logic you outline in the OP. 遍历DataFrame的行,并按照OP中概述的逻辑设置
Flowpath
列的值。
import pandas as pd
output_table = pd.DataFrame({'W' :[1.1, 2.1, 3.1, 4.1, 5.1, 6.1],
'X': [7., 8., 9., 10., 11., 12.],
'Y': ['A', 'B', 'C', 'D', 'E', 'E'],
'Z': ['First', ' ', 'Last', 'First', ' ', 'Last']},
index=range(1, 7))
output_table['Flowpath'] = ''
for idx in output_table.index:
this_Z = output_table.loc[idx, 'Z']
this_Y = output_table.loc[idx, 'Y']
last_Y = output_table.loc[idx-1, 'Y'] if idx > 1 else ''
last_Flowpath = output_table.loc[idx-1, 'Flowpath'] if idx > 1 else ''
if this_Z == 'First':
output_table.loc[idx, 'Flowpath'] = this_Y
elif this_Y != last_Y:
output_table.loc[idx, 'Flowpath'] = last_Flowpath + this_Y
else:
output_table.loc[idx, 'Flowpath'] = last_Flowpath
You can calculate a group variable by cumsum
on the condition vector where Z
is first
to satisfy the first and second conditions and replace the same value as previous one with empty string so that you can do cumsum
on the Y column which should give the expected output: 您可以在条件向量中按
cumsum
来计算组变量,其中Z
first
满足第一个和第二个条件,并将与前一个相同的值替换为空字符串,以便您可以在Y列上进行cumsum
,这将提供预期的输出:
import pandas as pd
# calculate the group varaible
grp = (output_table.Z == "First").cumsum()
# calculate a condition vector where the current Y column is the same as the previous one
dup = output_table.Y.groupby(grp).apply(lambda g: g.shift() != g)
# replace the duplicated process in Y as empty string, group the column by the group variable
# calculated above and then do a cumulative sum
output_table['flowPath'] = output_table.Y.where(dup, "").groupby(grp).cumsum()
output_table
# W X Y Z flowPath
# 1 1.1 7 A First A
# 2 2.1 8 B AB
# 3 3.1 9 C Last ABC
# 4 4.1 10 D First D
# 5 5.1 11 E DE
# 6 6.1 12 E Last DE
Update : The above code works under 0.15.2
but not 0.18.1
, but a little bit tweaking with the last line as following can save it: 更新 :上面的代码在
0.15.2
而不是0.18.1
,但是对最后一行进行一些调整,如下所示可以保存它:
output_table['flowPath'] = output_table.Y.where(dup, "").groupby(grp).apply(pd.Series.cumsum)
for index, row in output_table.iterrows():
prev_index = str(int(index) - 1)
if row['Z'] == 'First':
output_table.set_value(index, 'Flowpath', row['Y'])
elif output_table['Y'][prev_index] == row['Y']:
output_table.set_value(index, 'Flowpath', output_table['Flowpath'][prev_index])
else:
output_table.set_value(index, 'Flowpath', output_table['Flowpath'][prev_index] + row['Y'])
print output_table
W X Y Z Previous_Y Flowpath
1 1.1 7.0 A First NaN A
2 2.1 8.0 B A AB
3 3.1 9.0 C Last B ABC
4 4.1 10.0 D First C D
5 5.1 11.0 E D DE
6 6.1 12.0 E Last E DE
So bad things will happen if Z['1']!='First'
but for your case this works. 因此,如果
Z['1']!='First'
会发生不好的事情,但对于您而言,这是可行的。 I understand you want something more Pandas-ish so I'm sorry that this answer is pretty plain python... 我了解您想要更多的Pandas式的内容,所以很抱歉这个答案很简单。
import pandas as pd
import numpy as np
input_table = {'W' : pd.Series([1.1,2.1,3.1,4.1,5.1,6.1], index = ['1','2','3','4','5','6']),
'X' : pd.Series([7.,8.,9.,10.,11.,12.], index = ['1','2','3','4','5','6']),
'Y' : pd.Series(['A','B','C','D','E','E'], index = ['1','2','3','4','5','6']),
'Z' : pd.Series(['First',' ','Last','First',' ','Last'], index =['1','2','3','4','5','6'])}
ret = pd.Series([None,None,None,None,None,None], index = ['1','2','3','4','5','6'])
for k in [str(n) for n in range(1,7)]:
if(input_table['Z'][k]=='First'):
op = input_table['Y'][k]
else:
if(input_table['Y'][k]==input_table['Y'][str(int(k)-1)]):
op = ret[str(int(k)-1)]
else:
op = ret[str(int(k)-1)]+input_table['Y'][k]
ret[k]=op
input_table['Flowpath'] = ret
output_table = pd.DataFrame(input_table)
print output_table
Prints:: 打印::
Flowpath W X Y Z
1 A 1.1 7 A First
2 AB 2.1 8 B
3 ABC 3.1 9 C Last
4 D 4.1 10 D First
5 DE 5.1 11 E
6 DE 6.1 12 E Last
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.