[英]Compare rows of an array column with the headers of another data frame using Scala and Spark
I am using Scala and Spark.我正在使用 Scala 和 Spark。 I have two data frames.我有两个数据框。
The first one is like following:第一个如下所示:
+------+------+-----------+
| num1 | num2 | arr |
+------+------+-----------+
| 25 | 10 | [a,c] |
| 35 | 15 | [a,b,d] |
+------+------+-----------+
In the second one the data frame headers are在第二个数据帧头是
num1, num2, a, b, c, d
I have created a case class by adding all the possible header columns.我通过添加所有可能的标题列创建了一个案例类。
Now what I want is, by matching the columns num1 and num2, I have to check whether the array in arr column contains the headers of the second data frame.现在我想要的是,通过匹配列 num1 和 num2,我必须检查 arr 列中的数组是否包含第二个数据框的标题。 If it so the value should be 1, else 0.如果是这样,则该值应为 1,否则为 0。
So the required output is:所以所需的输出是:
+------+------+---+---+---+---+
| num1 | num2 | a | b | c | d |
+------+------+---+---+---+---+
| 25 | 10 | 1 | 0 | 1 | 0 |
| 35 | 15 | 1 | 1 | 0 | 1 |
+------+------+---+---+---+---+
If I understand correctly, you want to transform the array column arr
into one column per possible value, that would contain whether or not the array contains that value.如果我理解正确,您想将数组列arr
转换为每个可能值的一列,这将包含数组是否包含该值。
If so, you can use the array_contains
function like this:如果是这样,您可以像这样使用array_contains
函数:
val df = Seq((25, 10, Seq("a","c")), (35, 15, Seq("a","b","d")))
.toDF("num1", "num2", "arr")
val values = Seq("a", "b", "c", "d")
df
.select(Seq("num1", "num2").map(col) ++
values.map(x => array_contains('arr, x) as x) : _*)
.show
+----+----+---+---+---+---+
|num1|num2| a| b| c| d|
+----+----+---+---+---+---+
| 25| 10| 1| 0| 1| 0|
| 35| 15| 1| 1| 0| 1|
+----+----+---+---+---+---+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.