使用Spark Scala检查一个数据框列中的值是否在另一数据框列中存在

Question

I have 2 dataframes df1 and df2, 我有2个数据框df1和df2，

df1 has column Name with values like a,b,c etc df2 has column Id with values like a,b df1列Name带有a，b，c等值df2列ID带有a，b等值

If Name column in df1 has a match in Id column in df2 , then we need to have match status as 0. If there is no match then we need to have match status as 1. I know that I can put df2 ID column in a collection using collect and then check if Name column in df1 has matching entry. 如果df1中的Name列与df2中的Id列匹配，那么我们需要将匹配状态设置为0。如果没有匹配，则我们需要将匹配状态设置为1。我知道可以将df2 ID列放在使用collect进行收集，然后检查df1中的Name列是否具有匹配的条目。

val df1 = Seq(“Rey”, “John”).toDF(“Name”)
val df2 = Seq(“Rey”).toDF(“Id”)

val collect = df2.select("Id").map(r => r.getString(0)).collect.toList

something like, 就像是，

    val df3 = 
    df1.withColumn("match_sts",when(df1("Name").isin(collect).then(0).else(1)))

Expected output
+ — — + — -+
|Name|match_sts|
+ — — + — -+
| Rey| 0  |
|John| 1  |
+ — — + — -+

But I don't want to use collect here. 但是我不想在这里使用收集。 Is there any alternate approach available. 是否有其他替代方法可用。

Answer 1

With collect is not what you want, but is a well -known issue for DF col --> list conversion. 使用collect不是您想要的，而是DF col->列表转换的一个众所周知的问题。 If not a huge list, then you can do - this works actually, you can also broadcast the inlist: 如果列表不多，那么您可以做-实际上可行，您还可以广播inlist：

import org.apache.spark.sql.functions._

val df1 = Seq("Rey", "John", "Donald", "Trump").toDF("Name")
val df2 = Seq("Rey", "Donald").toDF("Id")

val inlist = df2.select("Id").map(r => r.getString(0)).collect.toList

val df3 = df1.withColumn("match_status", when(df1("Name").isin(inlist: _*),1).otherwise(0))
df3.show(false)

Even in the classical examples that use the stopwords from a file for filtering output, they do this: 即使在使用文件中的停用词来过滤输出的经典示例中，它们也会这样做：

val stopWords = stopWordsInput.flatMap(x => x.split(" ")).map(_.trim).collect.toSet

and broadcast if too big to the Workers. 并向工人广播（如果太大）。 But I am not sure what 1 lakh is!!! 但是我不确定十万是多少！！！

Another approach is with Spark SQL, relying on Catalyst to optimize SQL when EXISTS is used: 另一种方法是使用Spark SQL，在使用EXISTS时依靠Catalyst来优化SQL：

import spark.implicits._ 
import org.apache.spark.sql.functions._

val df1 = Seq("Rey", "John", "Donald", "Trump").toDF("Name")
val df2 = Seq("Rey", "Donald").toDF("Id") // This can be read from file and split etc.

// Optimizer converts to better physical plan for performance in general
df1.createOrReplaceTempView("searchlist") 
df2.createOrReplaceTempView("inlist")    
val df3 = spark.sql("""SELECT Name, 1 
                     FROM searchlist A
                    WHERE EXISTS (select B.Id from inlist B WHERE B.Id = A.Name )
                                   UNION
                   SELECT Name, 0 
                     FROM searchlist A
                    WHERE NOT EXISTS (select B.Id from inlist B WHERE B.Id = A.Name )
                """)
df3.show(false)

使用Spark Scala检查一个数据框列中的值是否在另一数据框列中存在

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-07-15 19:53:04

使用Spark Scala检查一个数据框列中的值是否在另一数据框列中存在

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-07-15 19:53:04

解决方案1
0 已采纳 2019-07-15 19:53:04