Complex Pandas Dataframe Manipulation

Question

I have a data frame that looks something like this:

import pandas as pd
df= pd.DataFrame({'ID1':['A','B','C','D','E'],\
                  'ID2':['B','A','D','C','E'],\
                  'Account':['94000','94500','94000','18300','94500'],\
                  'Amount':[100,-100,50,-50,100],\
                  'Match':['-','-','-','-','-']})
df

I am struggling with the most efficient way to identify when an item in 'ID1' is present in 'ID2' with a particular value of Account. For example, with a condition of Account=94500 should yield:

df= pd.DataFrame({'ID1':['A','B','C','D','E'],\
                  'ID2':['B','A','D','C','E'],\
                  'Account':['94000','94500','94000','18300','94500'],\            'Amount':[100,-100,50,-50,200],'Match':['True','-','-','-','-']})
df

ie only the first row should be tagged because A (in ID2) matches Account 94500

Answer 1

You can use pandas apply :

df['Match'] = df['ID1'].apply(lambda x: any((df['ID2']==x) & (df['Account']=='94500')))

Which gives:

  Account  Amount ID1 ID2  Match
0   94000     100   A   B   True
1   94500    -100   B   A  False
2   94000      50   C   D  False
3   18300     -50   D   C  False
4   94500     100   E   E   True

In words the logic is: "For each element in ID1 ( apply ), check if there is at least ( any ) a row of the dataframe where ID2 = ID1 and Account = 94500"

Answer 2

Your explanation is a bit unclear, but I think you want this:

mask = df[df.Account == '94500'].ID2
df.loc[df.ID1.isin(mask),"Match"] = True

  Account  Amount ID1 ID2 Match
0   94000     100   A   B  True
1   94500    -100   B   A     -
2   94000      50   C   D     -
3   18300     -50   D   C     -
4   94500     100   E   E  True

Also comparing both correct answers just for fun.

%timeit -r 10 df['Match'] = df['ID1'].apply(lambda x: any((df['ID2']==x) & (df['Account']=='94500')))
100 loops, best of 10: 4.21 ms per loop


 %timeit -r 10 df.loc[df.ID1.isin(df[df.Account == '94500'].ID2),"Match"] = True
1000 loops, best of 10: 1.48 ms per loop

Update to address a new use case

You mentioned that you have problems where there are two columns you want to use. Again I am not sure if I understood it correctly, but here is my take on it. Suppose you have another variable Prod and you want to choose both on Account == 94500 and Prod == 6901 .

In this case:

df= pd.DataFrame({'ID1':['A','B','C','D','E'],\
                  'ID2':['B','A','D','C','E'],\
                  'Account':['94000','94500','94000','18300','94500'],\
                  'Amount':[100,-100,50,-50,100],\
                  'Match':['-','-','-','-','-'],\
                  'Prod':[0,6901,0,0,0]
                })

mask = df[(df.Account == '94500') & (df.Prod == 6901)].ID2
df.loc[df.ID1.isin(mask),"Match"] = True

Result:

  Account  Amount ID1 ID2 Match  Prod
0   94000     100   A   B  True     0
1   94500    -100   B   A     -  6901
2   94000      50   C   D     -     0
3   18300     -50   D   C     -     0
4   94500     100   E   E     -     0

Now only 'A' in ID1 matches the condition, since 'A' is in ID2 in 2nd row, so only the first row is selected.

Complex Pandas Dataframe Manipulation

Question

2 answers

solution1
2 2017-07-26 17:31:19

solution2
2 ACCPTED 2017-07-26 17:33:24

Complex Pandas Dataframe Manipulation

Question

2 answers

solution1 2 2017-07-26 17:31:19

solution2 2 ACCPTED 2017-07-26 17:33:24

solution1
2 2017-07-26 17:31:19

solution2
2 ACCPTED 2017-07-26 17:33:24