[英]Check if value from one dataframe column exists in another dataframe column using Spark Scala
I have 2 dataframes df1 and df2, 我有2个数据框df1和df2,
df1
has column Name with values like a,b,c etc df2
has column Id with values like a,b df1
列Name带有a,b,c等值df2
列ID带有a,b等值
If Name column in df1
has a match in Id column in df2
, then we need to have match status as 0. If there is no match then we need to have match status as 1. I know that I can put df2
ID column in a collection using collect and then check if Name column in df1 has matching entry. 如果
df1
中的Name列与df2
中的Id列匹配,那么我们需要将匹配状态设置为0。如果没有匹配,则我们需要将匹配状态设置为1。我知道可以将df2
ID列放在使用collect进行收集,然后检查df1中的Name列是否具有匹配的条目。
val df1 = Seq(“Rey”, “John”).toDF(“Name”)
val df2 = Seq(“Rey”).toDF(“Id”)
val collect = df2.select("Id").map(r => r.getString(0)).collect.toList
something like, 就像是,
val df3 =
df1.withColumn("match_sts",when(df1("Name").isin(collect).then(0).else(1)))
Expected output
+ — — + — -+
|Name|match_sts|
+ — — + — -+
| Rey| 0 |
|John| 1 |
+ — — + — -+
But I don't want to use collect here. 但是我不想在这里使用收集。 Is there any alternate approach available.
是否有其他替代方法可用。
With collect is not what you want, but is a well -known issue for DF col --> list conversion. 使用collect不是您想要的,而是DF col->列表转换的一个众所周知的问题。 If not a huge list, then you can do - this works actually, you can also broadcast the inlist:
如果列表不多,那么您可以做-实际上可行,您还可以广播inlist:
import org.apache.spark.sql.functions._
val df1 = Seq("Rey", "John", "Donald", "Trump").toDF("Name")
val df2 = Seq("Rey", "Donald").toDF("Id")
val inlist = df2.select("Id").map(r => r.getString(0)).collect.toList
val df3 = df1.withColumn("match_status", when(df1("Name").isin(inlist: _*),1).otherwise(0))
df3.show(false)
Even in the classical examples that use the stopwords from a file for filtering output, they do this: 即使在使用文件中的停用词来过滤输出的经典示例中,它们也会这样做:
val stopWords = stopWordsInput.flatMap(x => x.split(" ")).map(_.trim).collect.toSet
and broadcast if too big to the Workers. 并向工人广播(如果太大)。 But I am not sure what 1 lakh is!!!
但是我不确定十万是多少!!!
Another approach is with Spark SQL, relying on Catalyst to optimize SQL when EXISTS is used: 另一种方法是使用Spark SQL,在使用EXISTS时依靠Catalyst来优化SQL:
import spark.implicits._
import org.apache.spark.sql.functions._
val df1 = Seq("Rey", "John", "Donald", "Trump").toDF("Name")
val df2 = Seq("Rey", "Donald").toDF("Id") // This can be read from file and split etc.
// Optimizer converts to better physical plan for performance in general
df1.createOrReplaceTempView("searchlist")
df2.createOrReplaceTempView("inlist")
val df3 = spark.sql("""SELECT Name, 1
FROM searchlist A
WHERE EXISTS (select B.Id from inlist B WHERE B.Id = A.Name )
UNION
SELECT Name, 0
FROM searchlist A
WHERE NOT EXISTS (select B.Id from inlist B WHERE B.Id = A.Name )
""")
df3.show(false)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.