简体   繁体   English

如何在Pyspark中比较两个数据帧

[英]How to compare two data frames in Pyspark

c = df[df['CUSTOMER_EMAIL_ID'].isin(d.CUSTOMER_EMAIL_ID)]

如何在PySpark中编写相同的表达式?

If you're asking "give me all the rows from df where the CUSTOMER_EMAIL_ID field has a matching value from the CUSTOMER_EMAIL_ID field in d ", then I think your question can be answered using a semi join , specifically: 如果你问:“给我所有行df ,其中CUSTOMER_EMAIL_ID领域已经从一个匹配值CUSTOMER_EMAIL_ID现场d ”,那么我认为你的问题可以用一个回答的半加盟 ,具体如下:

c = df.join(b, 'CUSTOMER_EMAIL_ID', 'leftsemi')

A left (right) semi join can be thought of conceptually as a inner join followed by dropping the right (left) columns. 从概念上讲,左(右)半联接可以视为内部联接,然后删除右(左)列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM