[英]How to create a new boolean column based on conditions between two or three columns from two dataframes?
I have two dataframes with different sizes, df1
and df2
.我有两个不同大小的数据帧df1
和df2
。 I'm trying to check if the values from df1
exist in a column in df2
and return True
or False
in new columns in df1
.我正在尝试检查df1
中的值是否存在于df2
的列中,并在df1
的新列中返回True
或False
。
The first dataframe is my reference.第一个dataframe是我的参考。 It's extracted from an xls file.它是从 xls 文件中提取的。
df1.head(10)
Out[29]:
PO Number Sales Document SO DO Document Number
0 3620556930 9001724124.0 4001458660.0 8001721322.0 1500017748
1 3620556930 9001723883.0 4001458865.0 8001721037.0 1500017540
2 3620556930 9001723884.0 4001459374.0 8001721038.0 1500017541
3 3620556930 9001723885.0 4001458101.0 8001721043.0 1500017542
4 3620547728 9001721907.0 4001457180.0 8001719172.0 1500015786
5 3620556930 9001721908.0 4001457724.0 8001719173.0 1500015787
6 TT030720 nan nan nan 700001897
7 3620518726 9600008914.0 5600008655.0 5600008655.0 1500008725
8 3620518726 9600008912.0 5600008653.0 5600008653.0 1500008723
9 3620518726 9600008913.0 5600008654.0 5600008654.0 1500008724
The second dataframe is from a table I scraped from a website.第二个 dataframe 来自我从网站上抓取的表格。
df2.head(10)
Out[32]:
PO No Doc Type SUS Doc No GR_GA Inv_SO_DO Doc Date
0 3620556930 Purchase Order 8001294233 CSL 27.08.2020
1 3620556930 Goods Receipt 7903307400 Goods Received 4001457724 04.09.2020
2 3620556930 Goods Receipt 7903307457 Goods Accepted 4001457724 04.09.2020
3 3620556930 Payment Request 3102053949 CCM Invoice 9001721908 23.09.2020
4 3620556930 Goods Receipt 7903333326 Goods Received 4001458660 29.09.2020
5 3620556930 Goods Receipt 7903333325 Goods Received 4001458101 29.09.2020
6 3620556930 Goods Receipt 7903333322 Goods Received 4001458865 29.09.2020
7 3620556930 Goods Receipt 7903333327 Goods Accepted 4001458660 29.09.2020
8 3620556930 Goods Receipt 7903333324 Goods Received 4001458660 29.09.2020
9 3620556930 Goods Receipt 7903333329 Goods Accepted 4001458865 29.09.2020
My thought process in getting the output is as below:我获得 output 的思路如下:
df1
, named df1['GR', 'GA', 'Inv']
.我将在df1
中创建另外三个列,命名为df1['GR', 'GA', 'Inv']
。df1['SO']
and df1['DO']
to check if they exist in df2['Inv_SO_DO']
.我将使用df1['SO']
和df1['DO']
中的值来检查它们是否存在于df2['Inv_SO_DO']
中。df2['GR_GA']
if it's a Goods Receipt, Goods Acceptance or Invoice.如果这些值存在,我将检查df2['GR_GA']
是收货单、收货单还是发票。 I'll then return True
or False
in columns df1['GR', 'GA', 'Inv']
depending on this check.然后,我将根据此检查在df1['GR', 'GA', 'Inv']
列中返回True
或False
。 I've tried a for
loop as below for to create a list of values to be added for ['GA']
but it just gave me a list of Falses.我已经尝试了一个for
循环,如下所示,用于创建要为['GA']
添加的值列表,但它只给了我一个 Falses 列表。
ga = []
t1 = x.iloc[:,2].values
t2 = y.iloc[:,4].values
t3 = y.iloc[:,3].values
for i in t1:
for j in t2:
for k in t3:
if i == j and k == 'Goods Receipt':
ga.append('True')
else:
ga.append('False')
The closest I got to a solution is from another question here .我最接近解决方案的是这里的另一个问题。 I tried the code and modified it but it didn't turn out right as well.我尝试了代码并对其进行了修改,但结果也不正确。 Either that, or I'm doing the code from the link wrong.要么,要么我正在错误地执行链接中的代码。
Any advise would be most welcomed!任何建议都将受到欢迎!
Output desired: Output 需要:
df1.head(4)
Out[43]:
PO Number Sales Document SO DO Document Number GR GA Inv
0 3620556930 9001724124.0 4001458660.0 8001721322.0 1500017748 True True True
1 3620556930 9001723883.0 4001458865.0 8001721037.0 1500017540 True False False
2 3620556930 9001723884.0 4001459374.0 8001721038.0 1500017541 False False False
3 3620556930 9001723885.0 4001458101.0 8001721043.0 1500017542 True True False
One way you could do this is the following:您可以执行此操作的一种方法如下:
df1
and df2
on either DO
or SO
(from the left) to Inv_SO_DO
(from the right).将DO
或SO
(从左起)上的df1
和df2
合并到Inv_SO_DO
(从右起)。 Note that in your case each SO
value corresponds to multiple rows in df2
, so perhaps you'll need to amend the merging logic a bit (eg latest appearing row in df2
?)请注意,在您的情况下,每个SO
值对应于df2
中的多行,因此您可能需要稍微修改合并逻辑(例如df2
中最新出现的行?)GR_GA
column with pd.get_dummies()
and then concatenate it with the columns you need from the merged df, after casting the dummies into boolean
type.使用pd.get_dummies()
“虚拟化” GR_GA
列,然后在将虚拟对象转换为boolean
类型后,将其与合并 df 中所需的列连接起来。For example:例如:
m = pd.concat([df1.merge(df2, left_on='SO', right_on='Inv_SO_DO', how='inner'),
df1.merge(df2, left_on='DO', right_on='Inv_SO_DO', how='inner')
])
desired_cols = ["PO_Number", "Sales_Document", "SO", "DO", "Document_Number", "CSL", "GoodsAccepted", "GoodsReceived"]
pd.concat([m, pd.get_dummies(m['GR_GA']).astype(bool)], axis=1)[desired_cols]
This gives a result as follows:结果如下:
PO_Number Sales_Document SO DO Document_Number CSL GoodsAccepted GoodsReceived CCMInvoice
0 3620556930 9001724124 4001458660 8001721322 1500017748 False False True False
1 3620556930 9001724124 4001458660 8001721322 1500017748 False True False False
2 3620556930 9001724124 4001458660 8001721322 1500017748 False False True False
3 3620556930 9001723883 4001458865 8001721037 1500017540 False False True False
Again, note that because each SO
and DO
in the example df1
that you provided can match more than 1 rows in df2
, perhaps you will need to add some custom logic on how to merge.再次注意,因为您提供的示例df1
中的每个SO
和DO
都可以匹配df2
中的多于 1 行,所以您可能需要添加一些关于如何合并的自定义逻辑。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.