简体   繁体   中英

How to create a new boolean column based on conditions between two or three columns from two dataframes?

I have two dataframes with different sizes, df1 and df2 . I'm trying to check if the values from df1 exist in a column in df2 and return True or False in new columns in df1 .

The first dataframe is my reference. It's extracted from an xls file.

df1.head(10)
Out[29]: 
    PO Number  Sales Document           SO           DO  Document Number
0  3620556930    9001724124.0 4001458660.0 8001721322.0       1500017748
1  3620556930    9001723883.0 4001458865.0 8001721037.0       1500017540
2  3620556930    9001723884.0 4001459374.0 8001721038.0       1500017541
3  3620556930    9001723885.0 4001458101.0 8001721043.0       1500017542
4  3620547728    9001721907.0 4001457180.0 8001719172.0       1500015786
5  3620556930    9001721908.0 4001457724.0 8001719173.0       1500015787
6    TT030720             nan          nan          nan        700001897
7  3620518726    9600008914.0 5600008655.0 5600008655.0       1500008725
8  3620518726    9600008912.0 5600008653.0 5600008653.0       1500008723
9  3620518726    9600008913.0 5600008654.0 5600008654.0       1500008724

The second dataframe is from a table I scraped from a website.

df2.head(10)
Out[32]: 
        PO No         Doc Type  SUS Doc No                    GR_GA   Inv_SO_DO  Doc Date
0  3620556930   Purchase Order  8001294233                      CSL              27.08.2020
1  3620556930    Goods Receipt  7903307400           Goods Received  4001457724  04.09.2020
2  3620556930    Goods Receipt  7903307457           Goods Accepted  4001457724  04.09.2020
3  3620556930  Payment Request  3102053949              CCM Invoice  9001721908  23.09.2020
4  3620556930    Goods Receipt  7903333326           Goods Received  4001458660  29.09.2020
5  3620556930    Goods Receipt  7903333325           Goods Received  4001458101  29.09.2020
6  3620556930    Goods Receipt  7903333322           Goods Received  4001458865  29.09.2020
7  3620556930    Goods Receipt  7903333327           Goods Accepted  4001458660  29.09.2020
8  3620556930    Goods Receipt  7903333324           Goods Received  4001458660  29.09.2020
9  3620556930    Goods Receipt  7903333329           Goods Accepted  4001458865  29.09.2020

My thought process in getting the output is as below:

  1. I'll create three additional columns in df1 , named df1['GR', 'GA', 'Inv'] .
  2. I'll use the values from df1['SO'] and df1['DO'] to check if they exist in df2['Inv_SO_DO'] .
  3. If the values exist, I'll then check df2['GR_GA'] if it's a Goods Receipt, Goods Acceptance or Invoice. I'll then return True or False in columns df1['GR', 'GA', 'Inv'] depending on this check.

I've tried a for loop as below for to create a list of values to be added for ['GA'] but it just gave me a list of Falses.

ga = []
t1 = x.iloc[:,2].values
t2 = y.iloc[:,4].values
t3 = y.iloc[:,3].values
for i in t1:
    for j in t2:
        for k in t3:
            if i == j and k == 'Goods Receipt':
                ga.append('True') 
                
            else:
                ga.append('False')

The closest I got to a solution is from another question here . I tried the code and modified it but it didn't turn out right as well. Either that, or I'm doing the code from the link wrong.

Any advise would be most welcomed!

Output desired:

df1.head(4)
Out[43]: 
    PO Number  Sales Document           SO           DO  Document Number     GR     GA    Inv
0  3620556930    9001724124.0 4001458660.0 8001721322.0       1500017748   True   True   True
1  3620556930    9001723883.0 4001458865.0 8001721037.0       1500017540   True  False  False
2  3620556930    9001723884.0 4001459374.0 8001721038.0       1500017541  False  False  False
3  3620556930    9001723885.0 4001458101.0 8001721043.0       1500017542   True   True  False

One way you could do this is the following:

  1. Merge df1 and df2 on either DO or SO (from the left) to Inv_SO_DO (from the right). Note that in your case each SO value corresponds to multiple rows in df2 , so perhaps you'll need to amend the merging logic a bit (eg latest appearing row in df2 ?)
  2. "Dummify" the GR_GA column with pd.get_dummies() and then concatenate it with the columns you need from the merged df, after casting the dummies into boolean type.

For example:

m = pd.concat([df1.merge(df2, left_on='SO', right_on='Inv_SO_DO', how='inner'),
               df1.merge(df2, left_on='DO', right_on='Inv_SO_DO', how='inner')
              ])

desired_cols = ["PO_Number", "Sales_Document", "SO", "DO", "Document_Number", "CSL", "GoodsAccepted", "GoodsReceived"]
pd.concat([m, pd.get_dummies(m['GR_GA']).astype(bool)], axis=1)[desired_cols]

This gives a result as follows:

    PO_Number   Sales_Document  SO          DO          Document_Number CSL GoodsAccepted   GoodsReceived   CCMInvoice
0   3620556930  9001724124      4001458660  8001721322  1500017748      False   False           True            False
1   3620556930  9001724124      4001458660  8001721322  1500017748      False   True            False           False
2   3620556930  9001724124      4001458660  8001721322  1500017748      False   False           True            False
3   3620556930  9001723883      4001458865  8001721037  1500017540      False   False           True            False

Again, note that because each SO and DO in the example df1 that you provided can match more than 1 rows in df2 , perhaps you will need to add some custom logic on how to merge.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM