简体   繁体   English

使用 Scala 和 Spark 将数组列的行与另一个数据帧的标题进行比较

[英]Compare rows of an array column with the headers of another data frame using Scala and Spark

I am using Scala and Spark.我正在使用 Scala 和 Spark。 I have two data frames.我有两个数据框。

The first one is like following:第一个如下所示:

+------+------+-----------+
| num1 | num2 |    arr    |
+------+------+-----------+
|   25 |   10 | [a,c]     |
|   35 |   15 | [a,b,d]   |
+------+------+-----------+

In the second one the data frame headers are在第二个数据帧头是

num1, num2, a, b, c, d

I have created a case class by adding all the possible header columns.我通过添加所有可能的标题列创建了一个案例类。

Now what I want is, by matching the columns num1 and num2, I have to check whether the array in arr column contains the headers of the second data frame.现在我想要的是,通过匹配列 num1 和 num2,我必须检查 arr 列中的数组是否包含第二个数据框的标题。 If it so the value should be 1, else 0.如果是这样,则该值应为 1,否则为 0。

So the required output is:所以所需的输出是:

+------+------+---+---+---+---+
| num1 | num2 | a | b | c | d |
+------+------+---+---+---+---+
|   25 |   10 | 1 | 0 | 1 | 0 |
|   35 |   15 | 1 | 1 | 0 | 1 |
+------+------+---+---+---+---+

If I understand correctly, you want to transform the array column arr into one column per possible value, that would contain whether or not the array contains that value.如果我理解正确,您想将数组列arr转换为每个可能值的一列,这将包含数组是否包含该值。

If so, you can use the array_contains function like this:如果是这样,您可以像这样使用array_contains函数:

val df = Seq((25, 10, Seq("a","c")), (35, 15, Seq("a","b","d")))
             .toDF("num1", "num2", "arr")

val values = Seq("a", "b", "c", "d")
df
    .select(Seq("num1", "num2").map(col) ++
            values.map(x => array_contains('arr, x) as x) : _*)
    .show
+----+----+---+---+---+---+
|num1|num2|  a|  b|  c|  d|
+----+----+---+---+---+---+
|  25|  10|  1|  0|  1|  0|
|  35|  15|  1|  1|  0|  1|
+----+----+---+---+---+---+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 Spark Dataframe (Scala) 中的另一列数组创建一列数组 - Creating a column of array using another column of array in a Spark Dataframe (Scala) 从列到数组Scala Spark - From Column to Array Scala Spark Array[Array[String]] to String in a column with Scala 和 Spark - Array[Array[String]] to String in a column with Scala and Spark 将 Json 数组拆分为两行 spark scala - split Json array into two rows spark scala 如何检查列数据Spark scala上的isEmpty - How to check isEmpty on Column Data Spark scala 如何使用Scala API操作将行转换为Scala 2D数组中的列 - How to transform rows to column in a scala 2d array using scala api operations 用于数组数据的 Spark 3 Scala UDF 中的 GenericRowWithSchema ClassCastException - GenericRowWithSchema ClassCastException in Spark 3 Scala UDF for Array data 使用 Scala 和 Spark 将列表值与案例类进行比较 - Compare a list values with case class using Scala and Spark 如何将表的一列中的字符串数组中的数据与表中另一列中的 JSON 数据进行比较? - How can I compare the data in an array of strings that is in one column of a table to JSON data within another column in the table? spark scala:将Struct列的Array转换为String列 - spark scala : Convert Array of Struct column to String column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM