[英]How to implement `except` in Apache Spark based on subset of columns?
I am working with two schema in spark, table1
and table2
: 我正在使用spark, table1
和table2
两个模式:
scala> table1.printSchema
root
|-- user_id: long (nullable = true)
|-- item_id: long (nullable = true)
|-- value: double (nullable = true)
scala> table2.printSchema
root
|-- item_id: long (nullable = true)
|-- user_id: long (nullable = true)
|-- value: double (nullable = true)
However, I have created these two from different sources. 但是,我从不同的来源创建了这两个。 Basically each of them is holding a value
information for ( user_id
, item_id
) pair which is a floating point data type, and as such, prone to floating point errors. 基本上它们中的每一个都保持( user_id
, item_id
)对的value
信息,该value
是浮点数据类型,并且因此容易出现浮点错误。 For example (1, 3, 4) in one table can be stored as (1, 3, 3.9998..) in another due to other calculations. 例如,由于其他计算,一个表中的(1,3,4)可以作为(1,3,3.9999 ..)存储在另一个表中。
I need remove rows with ( user_id
, item_id
) pair (guaranteed to be pair-wise unique) from table1
which are also present in table2
. 我需要从table1
中删除具有( user_id
, item_id
)对(保证成对唯一)的行,这些行也存在于table2
。 Something like this: 像这样的东西:
scala> table1.except(table2)
However, there is no way to tell except when it should determine two rows to be same, which in this case is just ( user_id
, item_id
). 但是,没有办法告诉它除了它应该确定两行是相同的,在这种情况下只是( user_id
, item_id
)。 I need to disregard value
for this. 我需要忽略这个value
。
How to do this using spark-sql? 如何使用spark-sql做到这一点?
Using a leftanti join
would be a possible solution. 使用左侧 join
将是一种可能的解决方案。 This will remove rows from the left table that are present in the right table for the given key. 这将删除左表中存在于给定键的右表中的行。
table1.join(table2, Seq("user_id", "item_id"), "leftanti")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.