简体   繁体   English

如何使用Apache Spark中的列子集减去DataFrames

[英]How to subtract DataFrames using subset of columns in Apache Spark

How can I perform filter operation on Dataframe1 using Dataframe2. 如何使用Dataframe2对Dataframe1执行过滤操作。 I want to remove rows from DataFrame1 for below matching condition 我想从DataFrame1中删除以下条件的行

Dataframe1.col1 = Dataframe2.col1
Dataframe1.col2 = Dataframe2.col2

My question is different than substract two dataframes because while substract we use all columns but in my question I want to use limited number of columns 我的问题与减去两个数据帧不同,因为减去时我们使用所有列,但在我的问题中我想使用有限的列数

join with " left_anti " 加入left_anti

scala> df1.show
+----+-----+-----+
|col1| col2| col3|
+----+-----+-----+
|   1|  one|   ek|
|   2|  two|  dho|
|   3|three|theen|
|   4| four|chaar|
+----+-----+-----+


scala> df2.show
+----+----+-----+
|col1|col2| col3|
+----+----+-----+
|   2| two|  dho|
|   4|four|chaar|
+----+----+-----+


scala> df1.join(df2, Seq("col1", "col2"), "left_anti").show
+----+-----+-----+
|col1| col2| col3|
+----+-----+-----+
|   1|  one|   ek|
|   3|three|theen|
+----+-----+-----+

Possible duplicate of : Spark: subtract two DataFrames if both datasets have exact same coulmns 可能重复的项目: Spark:如果两个数据集的库仑数完全相同,减去两个 DataFrames

If you want custom join condition then you can use "anti" join. 如果要自定义联接条件,则可以使用“ anti”联接。 Here is the pysaprk version 这是pysaprk版本

Creating two data frames: 创建两个数据框:

Dataframe1 : 数据框1:

l1 = [('col1_row1', 10), ('col1_row2', 20), ('col1_row3', 30)
df1 = spark.createDataFrame(l1).toDF('col1','col2')

df1.show()
+---------+----+
|     col1|col2|
+---------+----+
|col1_row1|  10|
|col1_row2|  20|
|col1_row3|  30|
+---------+----+

Dataframe2 : 数据框2:

l2 = [('col1_row1', 10), ('col1_row2', 20), ('col1_row4', 40)]
df2 = spark.createDataFrame(l2).toDF('col1','col2')
df2.show()
+---------+----+
|     col1|col2|
+---------+----+
|col1_row1|  10|
|col1_row2|  20|
|col1_row4|  40|
+---------+----+

Using subtract api : 使用减法api:

df_final = df1.subtract(df2)
df_final.show()
+---------+----+
|     col1|col2|
+---------+----+
|col1_row3|  30|
+---------+----+

Using left_anti : 使用left_anti:

Join condition: 加盟条件:

join_condition = [df1["col1"] == df2["col1"], df1["col2"] == df2["col2"]]

Join finally 最后加入

df_final = df1.join(df2, join_condition, 'left_anti')
df_final.show()
+---------+----+
|     col1|col2|
+---------+----+
|col1_row3|  30|
+---------+----+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM