简体   繁体   中英

Pandas Dataframe: Creating column based on multiple conditions

Edited

I'm sorry I didn't post it right the first time. The solutions suggested worked if there were only two entries of each Lead ID with Lead Status "A". I am changing my data. I again apologize.

Data:

Lead ID     Lead Status      Duration     Target
1-1H9C0XL   Too Small       -0.466177     1
1-1H9G33C   A               -0.620709     0 
1-1H9G33C   A               -0.500709     0
1-1H9G33C   A                0.337401     0
4-1HFORF8   No Fit          -0.343840     1
4-1HFSXOG   No Fit          -0.124920     1
4-1HLQ2IJ   A               -0.330962     0 
4-1HLQ2IJ   A                0.130818     0
4-1HLQ2IJ   A               -0.400817     0
4-1HLQ2IJ   A                0.240818     0

I want to accomplish following:

If there is a duplicate in the Lead ID and Lead Status , make all the Target values "1" for that LeadID with shorter Duration .

Desired Output

Lead ID     Lead Status      Duration     Target
1-1H9C0XL   Too Small       -0.466177     1
1-1H9G33C   A               -0.620709     1 
1-1H9G33C   A               -0.500709     1
1-1H9G33C   A                0.337401     0
4-1HFORF8   No Fit          -0.343840     1
4-1HFSXOG   No Fit          -0.124920     1
4-1HLQ2IJ   A               -0.330962     1 
4-1HLQ2IJ   A                0.130818     1
4-1HLQ2IJ   A               -0.400817     1
4-1HLQ2IJ   A                0.240818     0

I am not able to implement a condition of checking for duplicates and the value in duration to update the last column. I appreciate any assistance a lot.

Try this(assuming your df is sorted)

df.loc[df[df.duplicated(['LeadID','LeadStatus'],keep=False)].drop_duplicates(['LeadID','LeadStatus'],keep='first').index,'Target']=1
df
Out[895]: 
      LeadID LeadStatus  Duration  Target
0  1-1H9C0XL   TooSmall    -0.466       1
1  1-1H9G33C          A    -0.621       1
2  1-1H9G33C          A     0.337       0
3  4-1HFORF8      NoFit    -0.344       1
4  4-1HFSXOG      NoFit    -0.125       1
5  4-1HLQ2IJ          A    -0.331       1
6  4-1HLQ2IJ          A     0.241       0

Update


df=df.sort_values(['LeadID','LeadStatus','Duration'])

df.loc[df[df.duplicated(['LeadID','LeadStatus'],keep='last')].index,'Target']=1

Out[911]: 
      LeadID LeadStatus  Duration  Target
0  1-1H9C0XL   TooSmall    -0.466       1
1  1-1H9G33C          A    -0.621       1
2  1-1H9G33C          A    -0.501       1
3  1-1H9G33C          A     0.337       0
4  4-1HFORF8      NoFit    -0.344       1
5  4-1HFSXOG      NoFit    -0.125       1
8  4-1HLQ2IJ          A    -0.401       1
6  4-1HLQ2IJ          A    -0.331       1
7  4-1HLQ2IJ          A     0.131       1
9  4-1HLQ2IJ          A     0.241       0

Here is an idiomatic and performant answer.

df['Target'] += df.sort_values('Duration')\
                  .duplicated(subset=['Lead ID', 'Lead Status'], keep='last')

If you don't assume unique rows have a 1 then you can do the following.

df1 = df.sort_values('Duration')
unique = ~df1.duplicated(subset=['Lead ID', 'Lead Status'], keep=False) * 1
first = df1.duplicated(subset=['Lead ID', 'Lead Status'], keep='last') * 1
df['Target'] = unique + first

And a less performant way:

df.groupby(['Lead ID', 'Lead Status'])['Duration']\
  .transform(lambda x: 1 if len(x) == 1 else x < x.max())

     Lead ID Lead Status  Duration  Target
0  1-1H9C0XL   Too Small -0.466177       1
1  1-1H9G33C           A -0.620709       1
2  1-1H9G33C           A -0.500709       0
3  1-1H9G33C           A  0.337401       1
4  4-1HFORF8      No Fit -0.343840       1
5  4-1HFSXOG      No Fit -0.124920       1
6  4-1HLQ2IJ           A -0.330962       1
7  4-1HLQ2IJ           A  0.130818       1
8  4-1HLQ2IJ           A -0.400817       1
9  4-1HLQ2IJ           A  0.240818       0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM