简体   繁体   English

如何将 Spark dataframe 列与另一个 dataframe 列值进行比较

[英]How to compare Spark dataframe columns with another dataframe column values

I have two dataframes as below:我有两个数据框如下:

df1: Which will have few values as below. This is dynamic.

+--------------+
|tags          |
+--------------+
|first_name    |
|last_name     |
|primary_email |
|other_email   |
+--------------+

df2: The second dataframe has few pre-defined combinations as below: df2:第二个 dataframe 的预定义组合很少,如下所示:

+---------------------------------------------------------------------------------------------+
|combinations                                                                                 |
+---------------------------------------------------------------------------------------------+
|last_name, first_name, primary_email                                                         |
|last_name, first_name, other_email                                                           |
|last_name, primary_email, primary_phone                                                      |
|last_name, primary_email, secondary_phone                                                    |
|last_name, address_line1, address_line2,city_name, state_name,postal_code, country_code, guid|
+---------------------------------------------------------------------------------------------+

Expected Result DF: Now, I wanted to find that from my dataframe, is there any valid combinations I can make.预期结果 DF:现在,我想从我的 dataframe 中找到我可以做出的任何有效组合。 The result should have all the valid combinations if it matches any with combinations dataframe.如果结果与combinations dataframe 匹配,则结果应具有所有有效组合。

resultDF:

+---------------------------------------------------------------------------------------------+
|combinations                                                                                 |
+---------------------------------------------------------------------------------------------+
|last_name, first_name, primary_email                                                         |
|last_name, first_name, other_email                                                           |
+---------------------------------------------------------------------------------------------+

I tried an approach of converting both the dataframes into list and try to compare it, but I always getting 0 combinations.我尝试了一种将两个数据帧都转换为列表并尝试比较它的方法,但我总是得到 0 个组合。

The scala code I tried.我试过的 scala 代码。

val combinationList = combinations.map(r => r.getString(0)).collect.toList

var combList: Seq[Seq[String]]  = Seq.empty

    for (comb <- combinationList) {
      var tmp: Seq[String] = Seq.empty
      tmp = tmp :+ comb
      combList = combList :+ tmp
    }

val result = combList.filter(
      list => df1.filter(df1.col("tags").isin(list: _*)).count == list.size
    )

println(result.size)

This is always return 0. The answer should be 2.这总是返回 0。答案应该是 2。

Can someone guide me what is the best approach?有人可以指导我什么是最好的方法吗?

Try this.尝试这个。 Collect your df1, make a new array column in df2 with df1's values.收集你的 df1,用 df1 的值在 df2 中创建一个新的数组列。 Compare the two arrays using array_except if using Spark 2.4, which returns the difference of the tow arrays.如果使用 Spark 2.4,则使用array_except比较两个 arrays,它返回两个 arrays 的差异。 Then filter if the size of that == 0然后过滤如果那个== 0的大小

scala> val df1 = Seq(
     |   "first_name",
     |   "last_name",
     |   "primary_email",
     |   "other_email" 
     | ).toDF("tags")
df1: org.apache.spark.sql.DataFrame = [tags: string]

scala> 

scala> val df2 = Seq(
     | Seq("last_name", "first_name", "primary_email"),                                                         
     | Seq("last_name", "first_name", "other_email"),
     | Seq("last_name", "primary_email", "primary_phone"),                                                      
     | Seq("last_name", "primary_email", "secondary_phone"),
     | Seq("last_name", "address_line1", "address_line2", "city_name", "state_name", "postal_code", "country_code", "guid")
     | ).toDF("combinations")
df2: org.apache.spark.sql.DataFrame = [combinations: array<string>]

scala> 

scala> df1.show(false)
+-------------+
|tags         |
+-------------+
|first_name   |
|last_name    |
|primary_email|
|other_email  |
+-------------+


scala> 

scala> df2.show(false)
+-------------------------------------------------------------------------------------------------+
|combinations                                                                                     |
+-------------------------------------------------------------------------------------------------+
|[last_name, first_name, primary_email]                                                           |
|[last_name, first_name, other_email]                                                             |
|[last_name, primary_email, primary_phone]                                                        |
|[last_name, primary_email, secondary_phone]                                                      |
|[last_name, address_line1, address_line2, city_name, state_name, postal_code, country_code, guid]|
+-------------------------------------------------------------------------------------------------+


scala> 

scala> val df1tags = df1.collect.map(r => r.getString(0))
df1tags: Array[String] = Array(first_name, last_name, primary_email, other_email)

scala> 

scala> val df3 = df2.withColumn("tags", lit(df1tags))
df3: org.apache.spark.sql.DataFrame = [combinations: array<string>, tags: array<string>]

scala> df3.show(false)
+-------------------------------------------------------------------------------------------------+---------------------------------------------------+
|combinations                                                                                     |tags                                               |
+-------------------------------------------------------------------------------------------------+---------------------------------------------------+
|[last_name, first_name, primary_email]                                                           |[first_name, last_name, primary_email, other_email]|
|[last_name, first_name, other_email]                                                             |[first_name, last_name, primary_email, other_email]|
|[last_name, primary_email, primary_phone]                                                        |[first_name, last_name, primary_email, other_email]|
|[last_name, primary_email, secondary_phone]                                                      |[first_name, last_name, primary_email, other_email]|
|[last_name, address_line1, address_line2, city_name, state_name, postal_code, country_code, guid]|[first_name, last_name, primary_email, other_email]|
+-------------------------------------------------------------------------------------------------+---------------------------------------------------+


scala> 

scala> val df4 = df3.withColumn("combMinusTags", array_except($"combinations", $"tags"))
df4: org.apache.spark.sql.DataFrame = [combinations: array<string>, tags: array<string> ... 1 more field]

scala> df4.show(false)
+-------------------------------------------------------------------------------------------------+---------------------------------------------------+--------------------------------------------------------------------------------------+
|combinations                                                                                     |tags                                               |combMinusTags                                                                         |
+-------------------------------------------------------------------------------------------------+---------------------------------------------------+--------------------------------------------------------------------------------------+
|[last_name, first_name, primary_email]                                                           |[first_name, last_name, primary_email, other_email]|[]                                                                                    |
|[last_name, first_name, other_email]                                                             |[first_name, last_name, primary_email, other_email]|[]                                                                                    |
|[last_name, primary_email, primary_phone]                                                        |[first_name, last_name, primary_email, other_email]|[primary_phone]                                                                       |
|[last_name, primary_email, secondary_phone]                                                      |[first_name, last_name, primary_email, other_email]|[secondary_phone]                                                                     |
|[last_name, address_line1, address_line2, city_name, state_name, postal_code, country_code, guid]|[first_name, last_name, primary_email, other_email]|[address_line1, address_line2, city_name, state_name, postal_code, country_code, guid]|
+-------------------------------------------------------------------------------------------------+---------------------------------------------------+--------------------------------------------------------------------------------------+


scala> 

scala> 

scala> df4.filter(size($"combMinusTags") === 0).show(false)
+--------------------------------------+---------------------------------------------------+-------------+
|combinations                          |tags                                               |combMinusTags|
+--------------------------------------+---------------------------------------------------+-------------+
|[last_name, first_name, primary_email]|[first_name, last_name, primary_email, other_email]|[]           |
|[last_name, first_name, other_email]  |[first_name, last_name, primary_email, other_email]|[]           |
+--------------------------------------+---------------------------------------------------+-------------+


Spark 2.3火花 2.3

write your own array_except function as udf.将您自己的 array_except function 编写为 udf。

scala> def array_expt[T](a: Seq[T], b:Seq[T]):Seq[T] = {
     |   a.diff(b)
     | } 
array_expt: [T](a: Seq[T], b: Seq[T])Seq[T]

scala> 

scala> val myUdf = udf { array_expt[String] _ }
myUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(StringType,true),Some(List(ArrayType(StringType,true), ArrayType(StringType,true))))

scala> 

scala> val df4 = df3.withColumn("combMinusTags", myUdf($"combinations", $"tags"))
df4: org.apache.spark.sql.DataFrame = [combinations: array<string>, tags: array<string> ... 1 more field]

scala> df4.show(false)
+-------------------------------------------------------------------------------------------------+---------------------------------------------------+--------------------------------------------------------------------------------------+
|combinations                                                                                     |tags                                               |combMinusTags                                                                         |
+-------------------------------------------------------------------------------------------------+---------------------------------------------------+--------------------------------------------------------------------------------------+
|[last_name, first_name, primary_email]                                                           |[first_name, last_name, primary_email, other_email]|[]                                                                                    |
|[last_name, first_name, other_email]                                                             |[first_name, last_name, primary_email, other_email]|[]                                                                                    |
|[last_name, primary_email, primary_phone]                                                        |[first_name, last_name, primary_email, other_email]|[primary_phone]                                                                       |
|[last_name, primary_email, secondary_phone]                                                      |[first_name, last_name, primary_email, other_email]|[secondary_phone]                                                                     |
|[last_name, address_line1, address_line2, city_name, state_name, postal_code, country_code, guid]|[first_name, last_name, primary_email, other_email]|[address_line1, address_line2, city_name, state_name, postal_code, country_code, guid]|
+-------------------------------------------------------------------------------------------------+---------------------------------------------------+--------------------------------------------------------------------------------------+


scala> 

scala> df4.filter(size($"combMinusTags") === 0).show(false)
+--------------------------------------+---------------------------------------------------+-------------+
|combinations                          |tags                                               |combMinusTags|
+--------------------------------------+---------------------------------------------------+-------------+
|[last_name, first_name, primary_email]|[first_name, last_name, primary_email, other_email]|[]           |
|[last_name, first_name, other_email]  |[first_name, last_name, primary_email, other_email]|[]           |
+--------------------------------------+---------------------------------------------------+-------------+



暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从 spark 中的另一个 dataframe 值重命名 dataframe 列和数据类型? - How to rename a dataframe column and datatype from another dataframe values in spark? 如何在Apache Spark中将一列与同一数据框中的列进行比较 - How to compare a column with the columns in the same dataframe in apache spark 如何在 Spark 数据框中将值从一列交换到另一列 - How to swap the values from one column to another in Spark dataframe Spark - 基于另一个数据帧中一列的值查询数据帧 - Spark - query dataframe based on values from a column in another dataframe Spark DataFrame 列带有逗号分隔的其他列列表,这些列需要使用另一列中给定的值进行更新 - Spark DataFrame column with comma separated list of other columns that needs to be updated with values given in another column 如何将Spark Dataframe列嵌入到Map列? - How to embed spark dataframe columns to a map column? 如何将Spark Dataframe中的列拆分为多列 - How to split column in Spark Dataframe to multiple columns 如何将 spark DataFrame 行值映射到列? - How to map spark DataFrame row values to columns? 如何聚合火花数据框中 2 列的值 - How to aggregate the values of 2 columns in a spark dataframe 有没有办法将数据帧的一列中的所有行与另一个数据帧(火花)的另一列中的所有行进行比较? - Is there a way to compare all rows in one column of a dataframe against all rows in another column of another dataframe (spark)?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM