简体   繁体   中英

Complex Pandas Dataframe Manipulation

I have a data frame that looks something like this:

import pandas as pd
df= pd.DataFrame({'ID1':['A','B','C','D','E'],\
                  'ID2':['B','A','D','C','E'],\
                  'Account':['94000','94500','94000','18300','94500'],\
                  'Amount':[100,-100,50,-50,100],\
                  'Match':['-','-','-','-','-']})
df

I am struggling with the most efficient way to identify when an item in 'ID1' is present in 'ID2' with a particular value of Account. For example, with a condition of Account=94500 should yield:

df= pd.DataFrame({'ID1':['A','B','C','D','E'],\
                  'ID2':['B','A','D','C','E'],\
                  'Account':['94000','94500','94000','18300','94500'],\            'Amount':[100,-100,50,-50,200],'Match':['True','-','-','-','-']})
df

ie only the first row should be tagged because A (in ID2) matches Account 94500

You can use pandas apply :

df['Match'] = df['ID1'].apply(lambda x: any((df['ID2']==x) & (df['Account']=='94500')))

Which gives:

  Account  Amount ID1 ID2  Match
0   94000     100   A   B   True
1   94500    -100   B   A  False
2   94000      50   C   D  False
3   18300     -50   D   C  False
4   94500     100   E   E   True

In words the logic is: "For each element in ID1 ( apply ), check if there is at least ( any ) a row of the dataframe where ID2 = ID1 and Account = 94500"

Your explanation is a bit unclear, but I think you want this:

mask = df[df.Account == '94500'].ID2
df.loc[df.ID1.isin(mask),"Match"] = True

  Account  Amount ID1 ID2 Match
0   94000     100   A   B  True
1   94500    -100   B   A     -
2   94000      50   C   D     -
3   18300     -50   D   C     -
4   94500     100   E   E  True

Also comparing both correct answers just for fun.

%timeit -r 10 df['Match'] = df['ID1'].apply(lambda x: any((df['ID2']==x) & (df['Account']=='94500')))
100 loops, best of 10: 4.21 ms per loop


 %timeit -r 10 df.loc[df.ID1.isin(df[df.Account == '94500'].ID2),"Match"] = True
1000 loops, best of 10: 1.48 ms per loop

Update to address a new use case

You mentioned that you have problems where there are two columns you want to use. Again I am not sure if I understood it correctly, but here is my take on it. Suppose you have another variable Prod and you want to choose both on Account == 94500 and Prod == 6901 .

In this case:

df= pd.DataFrame({'ID1':['A','B','C','D','E'],\
                  'ID2':['B','A','D','C','E'],\
                  'Account':['94000','94500','94000','18300','94500'],\
                  'Amount':[100,-100,50,-50,100],\
                  'Match':['-','-','-','-','-'],\
                  'Prod':[0,6901,0,0,0]
                })

mask = df[(df.Account == '94500') & (df.Prod == 6901)].ID2
df.loc[df.ID1.isin(mask),"Match"] = True

Result:

  Account  Amount ID1 ID2 Match  Prod
0   94000     100   A   B  True     0
1   94500    -100   B   A     -  6901
2   94000      50   C   D     -     0
3   18300     -50   D   C     -     0
4   94500     100   E   E     -     0

Now only 'A' in ID1 matches the condition, since 'A' is in ID2 in 2nd row, so only the first row is selected.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM