简体   繁体   English

如果第一列中存在第二列值,则从两列中选择值的Spark Sql查询

[英]spark Sql query for selecting values from two columns if second column value is present in first column

Input- 输入-

col_a   col_b
 A        B
 D        B
 B        E
 C        A

I am trying to get output in following way using sparksql but I am unable to get desired output using NOT EXITS/Left Outer join. 我正在尝试使用sparksql以以下方式获取输出,但是我无法使用NOT EXITS / Left Outer join获得所需的输出。 Please help me to get following output. 请帮助我获得以下输出。

col_a    col_b
 A         B
 D         B
 C         A

I want to get values from both table if col_b value is present in col_a. 如果col_a中存在col_b值,我想从两个表中获取值。

Supposing that your columns aren't too large, I would do something like this: 假设您的列不太大,我将执行以下操作:

scala> val df = Seq(("A", "B"), ("D", "B"), ("B", "E"), ("C", "A")).toDF("col_a", "col_b")
df: org.apache.spark.sql.DataFrame = [col_a: string, col_b: string]

scala> df.show
+-----+-----+
|col_a|col_b|
+-----+-----+
|    A|    B|
|    D|    B|
|    B|    E|
|    C|    A|
+-----+-----+

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> import scala.collection.mutable.HashSet
import scala.collection.mutable.HashSet

scala> val col_a_vals = df.rdd.map{case Row(a: String, b: String) => a}.collect.toSeq
col_a_vals: Seq[String] = WrappedArray(A, D, B, C)

scala> val col_a_set = HashSet(col_a_vals :_*)
col_a_set: scala.collection.mutable.HashSet[String] = Set(B, C, D, A)

scala> val broad_set = sc.broadcast(col_a_set)
broad_set: org.apache.spark.broadcast.Broadcast[scala.collection.mutable.HashSet[String]] = Broadcast(56)

scala> val contains_col_a = udf((value: String) => broad_set.value.contains(value))
contains_col_a: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function1>,BooleanType,List(StringType))

scala> df.filter(contains_col_a($"col_b")).show
+-----+-----+
|col_a|col_b|
+-----+-----+
|    A|    B|
|    D|    B|
|    C|    A|
+-----+-----+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 SQL Query从表中获取前两个最高列值 - SQL Query to get first two highest column values from a table SQL 根据第一列的值对第二列进行排序 - SQL Order Second Column Based on Value From First Column 选择第二列中的值以及同一行中第一列中的值 - Selecting values from second column alongside the values from first column in the same row 通过从两列中选择值来插入同一表中的一列 - insert by selecting values from two columns into a column from the same table 从两列中选择值并将它们串联到另一张表的一列中 - Selecting values from two columns and concatenating them into a column on a different table 我想将第一个SQL表中的数据插入第二个SQL表中,而第二个表中不存在额外的列 - I want to insert data from first SQL table into second one also with an extra column not present in the second one SQL:根据另一个表中的列值选择列 - SQL: Selecting columns based on column value from another table 如果两个现有列匹配,则SQL查询添加新列,并从另一列添加值 - SQL query add new column if two existing columns matches and add value from another column SQL查询以查找第一列中的值相同的两列中的重复项 - Sql Query to find duplicates in 2 columns where the values in first column are same SQL 如何将多列折叠成两列,一列用于列标题,第二列用于列值 - SQL how to collapse multiple columns into two columns, one for column heading and the second for column value
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM