如何在Pyspark中比较两个数据帧

Question

c = df[df['CUSTOMER_EMAIL_ID'].isin(d.CUSTOMER_EMAIL_ID)]

如何在PySpark中编写相同的表达式？

Answer 1

If you're asking "give me all the rows from df where the CUSTOMER_EMAIL_ID field has a matching value from the CUSTOMER_EMAIL_ID field in d ", then I think your question can be answered using a semi join , specifically: 如果你问：“给我所有行df ，其中CUSTOMER_EMAIL_ID领域已经从一个匹配值CUSTOMER_EMAIL_ID现场d ”，那么我认为你的问题可以用一个回答的半加盟，具体如下：

c = df.join(b, 'CUSTOMER_EMAIL_ID', 'leftsemi')

A left (right) semi join can be thought of conceptually as a inner join followed by dropping the right (left) columns. 从概念上讲，左（右）半联接可以视为内部联接，然后删除右（左）列。

如何在Pyspark中比较两个数据帧

问题描述

1 个解决方案

解决方案1
0 2017-05-15 16:18:48

如何在Pyspark中比较两个数据帧

问题描述

1 个解决方案

解决方案1 0 2017-05-15 16:18:48

解决方案1
0 2017-05-15 16:18:48