简体   繁体   English

如何根据两个数据框中两列或三列之间的条件创建新的 boolean 列?

[英]How to create a new boolean column based on conditions between two or three columns from two dataframes?

I have two dataframes with different sizes, df1 and df2 .我有两个不同大小的数据帧df1df2 I'm trying to check if the values from df1 exist in a column in df2 and return True or False in new columns in df1 .我正在尝试检查df1中的值是否存在于df2的列中,并在df1的新列中返回TrueFalse

The first dataframe is my reference.第一个dataframe是我的参考。 It's extracted from an xls file.它是从 xls 文件中提取的。

df1.head(10)
Out[29]: 
    PO Number  Sales Document           SO           DO  Document Number
0  3620556930    9001724124.0 4001458660.0 8001721322.0       1500017748
1  3620556930    9001723883.0 4001458865.0 8001721037.0       1500017540
2  3620556930    9001723884.0 4001459374.0 8001721038.0       1500017541
3  3620556930    9001723885.0 4001458101.0 8001721043.0       1500017542
4  3620547728    9001721907.0 4001457180.0 8001719172.0       1500015786
5  3620556930    9001721908.0 4001457724.0 8001719173.0       1500015787
6    TT030720             nan          nan          nan        700001897
7  3620518726    9600008914.0 5600008655.0 5600008655.0       1500008725
8  3620518726    9600008912.0 5600008653.0 5600008653.0       1500008723
9  3620518726    9600008913.0 5600008654.0 5600008654.0       1500008724

The second dataframe is from a table I scraped from a website.第二个 dataframe 来自我从网站上抓取的表格。

df2.head(10)
Out[32]: 
        PO No         Doc Type  SUS Doc No                    GR_GA   Inv_SO_DO  Doc Date
0  3620556930   Purchase Order  8001294233                      CSL              27.08.2020
1  3620556930    Goods Receipt  7903307400           Goods Received  4001457724  04.09.2020
2  3620556930    Goods Receipt  7903307457           Goods Accepted  4001457724  04.09.2020
3  3620556930  Payment Request  3102053949              CCM Invoice  9001721908  23.09.2020
4  3620556930    Goods Receipt  7903333326           Goods Received  4001458660  29.09.2020
5  3620556930    Goods Receipt  7903333325           Goods Received  4001458101  29.09.2020
6  3620556930    Goods Receipt  7903333322           Goods Received  4001458865  29.09.2020
7  3620556930    Goods Receipt  7903333327           Goods Accepted  4001458660  29.09.2020
8  3620556930    Goods Receipt  7903333324           Goods Received  4001458660  29.09.2020
9  3620556930    Goods Receipt  7903333329           Goods Accepted  4001458865  29.09.2020

My thought process in getting the output is as below:我获得 output 的思路如下:

  1. I'll create three additional columns in df1 , named df1['GR', 'GA', 'Inv'] .我将在df1中创建另外三个列,命名为df1['GR', 'GA', 'Inv']
  2. I'll use the values from df1['SO'] and df1['DO'] to check if they exist in df2['Inv_SO_DO'] .我将使用df1['SO']df1['DO']中的值来检查它们是否存在于df2['Inv_SO_DO']中。
  3. If the values exist, I'll then check df2['GR_GA'] if it's a Goods Receipt, Goods Acceptance or Invoice.如果这些值存在,我将检查df2['GR_GA']是收货单、收货单还是发票。 I'll then return True or False in columns df1['GR', 'GA', 'Inv'] depending on this check.然后,我将根据此检查在df1['GR', 'GA', 'Inv']列中返回TrueFalse

I've tried a for loop as below for to create a list of values to be added for ['GA'] but it just gave me a list of Falses.我已经尝试了一个for循环,如下所示,用于创建要为['GA']添加的值列表,但它只给了我一个 Falses 列表。

ga = []
t1 = x.iloc[:,2].values
t2 = y.iloc[:,4].values
t3 = y.iloc[:,3].values
for i in t1:
    for j in t2:
        for k in t3:
            if i == j and k == 'Goods Receipt':
                ga.append('True') 
                
            else:
                ga.append('False')

The closest I got to a solution is from another question here .我最接近解决方案的是这里的另一个问题。 I tried the code and modified it but it didn't turn out right as well.我尝试了代码并对其进行了修改,但结果也不正确。 Either that, or I'm doing the code from the link wrong.要么,要么我正在错误地执行链接中的代码。

Any advise would be most welcomed!任何建议都将受到欢迎!

Output desired: Output 需要:

df1.head(4)
Out[43]: 
    PO Number  Sales Document           SO           DO  Document Number     GR     GA    Inv
0  3620556930    9001724124.0 4001458660.0 8001721322.0       1500017748   True   True   True
1  3620556930    9001723883.0 4001458865.0 8001721037.0       1500017540   True  False  False
2  3620556930    9001723884.0 4001459374.0 8001721038.0       1500017541  False  False  False
3  3620556930    9001723885.0 4001458101.0 8001721043.0       1500017542   True   True  False

One way you could do this is the following:您可以执行此操作的一种方法如下:

  1. Merge df1 and df2 on either DO or SO (from the left) to Inv_SO_DO (from the right).DOSO (从左起)上的df1df2合并到Inv_SO_DO (从右起)。 Note that in your case each SO value corresponds to multiple rows in df2 , so perhaps you'll need to amend the merging logic a bit (eg latest appearing row in df2 ?)请注意,在您的情况下,每个SO值对应于df2中的多行,因此您可能需要稍微修改合并逻辑(例如df2中最新出现的行?)
  2. "Dummify" the GR_GA column with pd.get_dummies() and then concatenate it with the columns you need from the merged df, after casting the dummies into boolean type.使用pd.get_dummies() “虚拟化” GR_GA列,然后在将虚拟对象转换为boolean类型后,将其与合并 df 中所需的列连接起来。

For example:例如:

m = pd.concat([df1.merge(df2, left_on='SO', right_on='Inv_SO_DO', how='inner'),
               df1.merge(df2, left_on='DO', right_on='Inv_SO_DO', how='inner')
              ])

desired_cols = ["PO_Number", "Sales_Document", "SO", "DO", "Document_Number", "CSL", "GoodsAccepted", "GoodsReceived"]
pd.concat([m, pd.get_dummies(m['GR_GA']).astype(bool)], axis=1)[desired_cols]

This gives a result as follows:结果如下:

    PO_Number   Sales_Document  SO          DO          Document_Number CSL GoodsAccepted   GoodsReceived   CCMInvoice
0   3620556930  9001724124      4001458660  8001721322  1500017748      False   False           True            False
1   3620556930  9001724124      4001458660  8001721322  1500017748      False   True            False           False
2   3620556930  9001724124      4001458660  8001721322  1500017748      False   False           True            False
3   3620556930  9001723883      4001458865  8001721037  1500017540      False   False           True            False

Again, note that because each SO and DO in the example df1 that you provided can match more than 1 rows in df2 , perhaps you will need to add some custom logic on how to merge.再次注意,因为您提供的示例df1中的每个SODO都可以匹配df2中的多于 1 行,所以您可能需要添加一些关于如何合并的自定义逻辑。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何根据其他两列中的条件创建和填充新列? - how to create and fill a new column based on conditions in two other columns? 根据两个 Pandas DataFrames 之间的条件为新列赋值 - Assign values to new column based on conditions between two pandas DataFrames 根据两个数据框和条件添加一个新列 - Add a new column based on two dataframes and conditions 如何根据两个数据框和条件添加新列 - How can I add a new column based on two dataframes and conditions 如何在基于两个数据框之间的多个条件的数据框中获取新列? - How to get new column in dataframe that is based on multiple conditions between two dataframes? 根据来自两个数据帧的条件创建一对 - Create a pair based on conditions from two dataframes 根据在两个熊猫数据框之间的多种条件选择来创建新列 - Creating a new column based on selecting by multiple conditions between two pandas dataframes 根据来自两个数据帧和多个条件的条件创建一对 - Create a pair based on conditions from two dataframes and multiple conditions 如何根据Pandas中条件的现有列创建两列? - How create two columns based on existing column with conditions in Pandas? 如何创建一个新的数据框,其中包含两个现有数据框之间多列的值更改 - How to create a new dataframe that contains the value changes from multiple columns between two exisitng dataframes
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM