[英]Pandas Dataframe: Creating column based on multiple conditions
Edited 编辑
I'm sorry I didn't post it right the first time. 抱歉,我第一次没有正确发布。 The solutions suggested worked if there were only two entries of each
Lead ID
with Lead Status
"A". 如果每个
Lead ID
只有两个条目的Lead Status
“ A”,则建议的解决方案有效。 I am changing my data. 我正在更改数据。 I again apologize.
我再次道歉。
Data: 数据:
Lead ID Lead Status Duration Target
1-1H9C0XL Too Small -0.466177 1
1-1H9G33C A -0.620709 0
1-1H9G33C A -0.500709 0
1-1H9G33C A 0.337401 0
4-1HFORF8 No Fit -0.343840 1
4-1HFSXOG No Fit -0.124920 1
4-1HLQ2IJ A -0.330962 0
4-1HLQ2IJ A 0.130818 0
4-1HLQ2IJ A -0.400817 0
4-1HLQ2IJ A 0.240818 0
I want to accomplish following: 我要完成以下工作:
If there is a duplicate in the Lead ID
and Lead Status
, make all the Target
values "1" for that LeadID
with shorter Duration
. 如果销售
Lead ID
和销售Lead Status
重复, LeadID
Duration
较短的该LeadID
所有Target
都设置为“ 1”。
Desired Output 期望的输出
Lead ID Lead Status Duration Target
1-1H9C0XL Too Small -0.466177 1
1-1H9G33C A -0.620709 1
1-1H9G33C A -0.500709 1
1-1H9G33C A 0.337401 0
4-1HFORF8 No Fit -0.343840 1
4-1HFSXOG No Fit -0.124920 1
4-1HLQ2IJ A -0.330962 1
4-1HLQ2IJ A 0.130818 1
4-1HLQ2IJ A -0.400817 1
4-1HLQ2IJ A 0.240818 0
I am not able to implement a condition of checking for duplicates and the value in duration to update the last column. 我无法实现检查重复项和持续时间值以更新最后一列的条件。 I appreciate any assistance a lot.
非常感谢您的协助。
Try this(assuming your df is sorted) 试试这个(假设您的df已排序)
df.loc[df[df.duplicated(['LeadID','LeadStatus'],keep=False)].drop_duplicates(['LeadID','LeadStatus'],keep='first').index,'Target']=1
df
Out[895]:
LeadID LeadStatus Duration Target
0 1-1H9C0XL TooSmall -0.466 1
1 1-1H9G33C A -0.621 1
2 1-1H9G33C A 0.337 0
3 4-1HFORF8 NoFit -0.344 1
4 4-1HFSXOG NoFit -0.125 1
5 4-1HLQ2IJ A -0.331 1
6 4-1HLQ2IJ A 0.241 0
Update 更新
df=df.sort_values(['LeadID','LeadStatus','Duration'])
df.loc[df[df.duplicated(['LeadID','LeadStatus'],keep='last')].index,'Target']=1
Out[911]:
LeadID LeadStatus Duration Target
0 1-1H9C0XL TooSmall -0.466 1
1 1-1H9G33C A -0.621 1
2 1-1H9G33C A -0.501 1
3 1-1H9G33C A 0.337 0
4 4-1HFORF8 NoFit -0.344 1
5 4-1HFSXOG NoFit -0.125 1
8 4-1HLQ2IJ A -0.401 1
6 4-1HLQ2IJ A -0.331 1
7 4-1HLQ2IJ A 0.131 1
9 4-1HLQ2IJ A 0.241 0
Here is an idiomatic and performant answer. 这是一个惯用且高效的答案。
df['Target'] += df.sort_values('Duration')\
.duplicated(subset=['Lead ID', 'Lead Status'], keep='last')
If you don't assume unique rows have a 1 then you can do the following. 如果您不假设唯一行的值为1,则可以执行以下操作。
df1 = df.sort_values('Duration')
unique = ~df1.duplicated(subset=['Lead ID', 'Lead Status'], keep=False) * 1
first = df1.duplicated(subset=['Lead ID', 'Lead Status'], keep='last') * 1
df['Target'] = unique + first
And a less performant way: 和一种性能较低的方式:
df.groupby(['Lead ID', 'Lead Status'])['Duration']\
.transform(lambda x: 1 if len(x) == 1 else x < x.max())
Lead ID Lead Status Duration Target
0 1-1H9C0XL Too Small -0.466177 1
1 1-1H9G33C A -0.620709 1
2 1-1H9G33C A -0.500709 0
3 1-1H9G33C A 0.337401 1
4 4-1HFORF8 No Fit -0.343840 1
5 4-1HFSXOG No Fit -0.124920 1
6 4-1HLQ2IJ A -0.330962 1
7 4-1HLQ2IJ A 0.130818 1
8 4-1HLQ2IJ A -0.400817 1
9 4-1HLQ2IJ A 0.240818 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.