如何在Apache Spark中基于列的子集实现`except`？

Question

I am working with two schema in spark, table1 and table2 : 我正在使用spark， table1和table2两个模式：

scala> table1.printSchema
root
 |-- user_id: long (nullable = true)
 |-- item_id: long (nullable = true)
 |-- value: double (nullable = true)

scala> table2.printSchema
root
 |-- item_id: long (nullable = true)
 |-- user_id: long (nullable = true)
 |-- value: double (nullable = true)

However, I have created these two from different sources. 但是，我从不同的来源创建了这两个。 Basically each of them is holding a value information for ( user_id , item_id ) pair which is a floating point data type, and as such, prone to floating point errors. 基本上它们中的每一个都保持（ user_id ， item_id ）对的value信息，该value是浮点数据类型，并且因此容易出现浮点错误。 For example (1, 3, 4) in one table can be stored as (1, 3, 3.9998..) in another due to other calculations. 例如，由于其他计算，一个表中的（1,3,4）可以作为（1,3,3.9999 ..）存储在另一个表中。

I need remove rows with ( user_id , item_id ) pair (guaranteed to be pair-wise unique) from table1 which are also present in table2 . 我需要从table1中删除具有（ user_id ， item_id ）对（保证成对唯一）的行，这些行也存在于table2 。 Something like this: 像这样的东西：

scala> table1.except(table2)

However, there is no way to tell except when it should determine two rows to be same, which in this case is just ( user_id , item_id ). 但是，没有办法告诉它除了它应该确定两行是相同的，在这种情况下只是（ user_id ， item_id ）。 I need to disregard value for this. 我需要忽略这个value 。

How to do this using spark-sql? 如何使用spark-sql做到这一点？

Answer 1

Using a leftanti join would be a possible solution. 使用左侧 join将是一种可能的解决方案。 This will remove rows from the left table that are present in the right table for the given key. 这将删除左表中存在于给定键的右表中的行。

table1.join(table2, Seq("user_id", "item_id"), "leftanti")

如何在Apache Spark中基于列的子集实现`except`？

问题描述

1 个解决方案

解决方案1
4 已采纳 2018-03-14 05:44:31

如何在Apache Spark中基于列的子集实现`except`？

问题描述

1 个解决方案

解决方案1 4 已采纳 2018-03-14 05:44:31

解决方案1
4 已采纳 2018-03-14 05:44:31