简体   繁体   English

如何在Apache Spark中基于列的子集实现`except`?

[英]How to implement `except` in Apache Spark based on subset of columns?

I am working with two schema in spark, table1 and table2 : 我正在使用spark, table1table2两个模式:

scala> table1.printSchema
root
 |-- user_id: long (nullable = true)
 |-- item_id: long (nullable = true)
 |-- value: double (nullable = true)

scala> table2.printSchema
root
 |-- item_id: long (nullable = true)
 |-- user_id: long (nullable = true)
 |-- value: double (nullable = true)

However, I have created these two from different sources. 但是,我从不同的来源创建了这两个。 Basically each of them is holding a value information for ( user_id , item_id ) pair which is a floating point data type, and as such, prone to floating point errors. 基本上它们中的每一个都保持( user_iditem_id )对的value信息,该value是浮点数据类型,并且因此容易出现浮点错误。 For example (1, 3, 4) in one table can be stored as (1, 3, 3.9998..) in another due to other calculations. 例如,由于其他计算,一个表中的(1,3,4)可以作为(1,3,3.9999 ..)存储在另一个表中。

I need remove rows with ( user_id , item_id ) pair (guaranteed to be pair-wise unique) from table1 which are also present in table2 . 我需要从table1中删除具有( user_iditem_id )对(保证成对唯一)的行,这些行也存在于table2 Something like this: 像这样的东西:

scala> table1.except(table2)

However, there is no way to tell except when it should determine two rows to be same, which in this case is just ( user_id , item_id ). 但是,没有办法告诉它除了它应该确定两行是相同的,在这种情况下只是( user_iditem_id )。 I need to disregard value for this. 我需要忽略这个value

How to do this using spark-sql? 如何使用spark-sql做到这一点?

Using a leftanti join would be a possible solution. 使用左侧 join将是一种可能的解决方案。 This will remove rows from the left table that are present in the right table for the given key. 这将删除左表中存在于给定键的右表中的行。

table1.join(table2, Seq("user_id", "item_id"), "leftanti")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用Apache Spark中的列子集减去DataFrames - How to subtract DataFrames using subset of columns in Apache Spark 如何根据 Spark Scala 中的列 dtypes 返回 DataFrame 列的子集 - How to return a subset of the DataFrame’s columns based on the column dtypes in Spark Scala 如何基于两列订购Spark RDD - how to order spark RDD based on two columns 在Apache Spark 2.1.0中对DataFrame使用Except - Using Except on DataFrame in Apache Spark 2.1.0 Spark-基于用户ID出现的子集数据集 - Spark - subset dataset based on occurences of user id 如何在Apache Spark中将“用户定义的函数”应用于由“按操作分组”的每个子集? - How to apply User Defined Function on each subset formed by a Group By Operation in Apache Spark? spark中如何根据源列动态添加列 scala dataframe - How to dynamically add columns based on source columns in spark scala dataframe Apache Spark - 如何为数据框中的每一列创建不同的列? - Apache Spark - how to create difference columns for every column in dataframe? 如何重命名通过Apache Spark中的GroupedDataset操作创建的新列? - How to rename newly columns that are created by operation on GroupedDataset in Apache Spark? 如何在多列上对数据帧进行分区并将输出写入 Apache Spark 中的 xlsx - How to partition a dataframe on multiple columns and write the output to xlsx in Apache Spark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM