[英]Spark Scala Delete rows in one RDD based on columns of another RDD
I'm very new to scala and spark and not sure how to start. 我对scala和spark非常陌生,不确定如何开始。
I have one RDD that looks like this: 我有一个RDD看起来像这样:
1,2,3,11
2,1,4,12
1,4,5,13
3,5,6,12
Another that looks like this: 另一个看起来像这样:
2,1
1,2
I want to filter the first RDD such that it will delete any rows that are matching the first two columns of the second RDD. 我想过滤第一个RDD,以便它将删除与第二个RDD的前两列匹配的所有行。 The output should look like:
输出应如下所示:
1,4,5,13
3,5,6,12
// input rdds
val rdd1 = spark.sparkContext.makeRDD(Seq((1,2,3,11), (2,1,3,12), (1,4,5,13), (3,5,6,12)))
val rdd2 = spark.sparkContext.makeRDD(Seq((1,2), (2,1)))
// manipulate the 2 rdds as a key, val pair
// the key of the first rdd is a tuple pair of first two fields, the val contains all the fields
// the key of the second rdd is a tuple of first two fields, the val is just null
// then we could perform joins on their key
val rdd1_key = rdd1.map(record => ((record._1, record._2), record))
val rdd2_key = rdd2.map(record => (record, null))
// 1. perform left outer join, the record become (key, (val1, val2))
// 2. filter, keep those records which do not have a join
// if there is no join, val2 will be None, otherwise val2 will be null, which is the value we hardcoded from previous step
// 3. get val1
rdd1_key.leftOuterJoin(rdd2_key)
.filter(record => record._2._2 == None)
.map(record => record._2._1)
.collect().foreach(println(_))
// result
(1,4,5,13)
(3,5,6,12)
Thanks 谢谢
I personally prefer dataframe/dataset
way as they are optimized forms of rdd
and with more inbuilt functions and similar to traditional databases. 我个人更喜欢
dataframe/dataset
方式,因为它们是rdd
优化形式,具有更多内置函数,并且与传统数据库相似。
following is the dataframe
way: 以下是
dataframe
方式:
First step would be to convert both of the rdds
to dataframes
第一步是将两个
rdds
都转换为dataframes
import sqlContext.implicits._
val df1 = rdd1.toDF("col1", "col2", "col3", "col4")
val df2 = rdd2.toDF("col1", "col2")
Second step would be to add a new column
in dataframe2
for filtering condition checking 第二步是在
dataframe2
添加新column
以过滤条件检查
import org.apache.spark.sql.functions._
val tempdf2 = df2.withColumn("check", lit("check"))
And final step would be to join
the two dataframes
, filter
and drop
the unnecessary rows
and columns
. 最后一步将是
join
两个dataframes
, filter
和drop
了不必要的rows
和columns
。
val finalDF = df1.join(tempdf2, Seq("col1", "col2"), "left")
.filter($"check".isNull)
.drop($"check")
You should have final dataframe
as 您应该具有最终
dataframe
为
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|3 |5 |6 |12 |
|1 |4 |5 |13 |
+----+----+----+----+
Now you can either convert to rdd
using finalDF.rdd
or you can continue your further processing with dataframe
itself. 现在,您可以使用
finalDF.rdd
转换为rdd
,也可以继续使用dataframe
本身进行进一步处理。
I hope the answer is helpful 我希望答案是有帮助的
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.