从 Scala Spark 中其他列的值创建新列

Question

I have an input dataframe:我有一个输入 dataframe：

inputDF =输入DF =

+--------------------------+-----------------------------+
| info (String)            |   chars (Seq[String])       |
+--------------------------+-----------------------------+
|weight=100,height=70      | [weight,height]             |
+--------------------------+-----------------------------+
|weight=92,skinCol=white   | [weight,skinCol]            |
+--------------------------+-----------------------------+
|hairCol=gray,skinCol=white| [hairCol,skinCol]           |
+--------------------------+-----------------------------+

How to I get this dataframe as an output?如何将此 dataframe 转换为 output？ I do not know in advance what are the strings contained in chars column我事先不知道 chars 列中包含的字符串是什么

outputDF =输出DF =

+--------------------------+-----------------------------+-------+-------+-------+-------+
| info (String)            |   chars (Seq[String])       | weight|height |skinCol|hairCol|
+--------------------------+-----------------------------+-------+-------+-------+-------+
|weight=100,height=70      | [weight,height]             |  100  | 70    | null  |null   |
+--------------------------+-----------------------------+-------+-------+-------+-------+
|weight=92,skinCol=white   | [weight,skinCol]            |  92   |null   |white  |null   |
+--------------------------+-----------------------------+-------+-------+-------+-------+
|hairCol=gray,skinCol=white| [hairCol,skinCol]           |null   |null   |white  |gray   |
+--------------------------+-----------------------------+-------+-------+-------+-------+

I also would like to save the following Seq[String] as a variable, but without using .collect() function on the dataframes.我还想将以下 Seq[String] 保存为变量，但不在数据帧上使用.collect() function。

val aVariable: Seq[String] = [weight, height, skinCol, hairCol]

Answer 1

You create another dataframe pivoting on the key of info column than join it back using an id column:您创建另一个 dataframe 以信息列的键为中心，而不是使用 id 列将其连接回去：

import spark.implicits._
val data = Seq(
  ("weight=100,height=70", Seq("weight", "height")),
  ("weight=92,skinCol=white", Seq("weight", "skinCol")),
  ("hairCol=gray,skinCol=white", Seq("hairCol", "skinCol"))
)

val df = spark.sparkContext.parallelize(data).toDF("info", "chars")
  .withColumn("id", monotonically_increasing_id() + 1)

val pivotDf = df
  .withColumn("tmp", split(col("info"), ","))
  .withColumn("tmp", explode(col("tmp")))
  .withColumn("val1", split(col("tmp"), "=")(0))
  .withColumn("val2", split(col("tmp"), "=")(1)).select("id", "val1", "val2")
  .groupBy("id").pivot("val1").agg(first(col("val2")))

df.join(pivotDf, Seq("id"), "left").drop("id").show(false)


+--------------------------+------------------+-------+------+-------+------+
|info                      |chars             |hairCol|height|skinCol|weight|
+--------------------------+------------------+-------+------+-------+------+
|weight=100,height=70      |[weight, height]  |null   |70    |null   |100   |
|hairCol=gray,skinCol=white|[hairCol, skinCol]|gray   |null  |white  |null  |
|weight=92,skinCol=white   |[weight, skinCol] |null   |null  |white  |92    |
+--------------------------+------------------+-------+------+-------+------+

for your second question you can get those values in a dataframe like this:对于第二个问题，您可以像这样在 dataframe 中获取这些值：

df.withColumn("tmp", explode(split(col("info"), ",")))
  .withColumn("values", split(col("tmp"), "=")(0)).select("values").distinct().show()

+-------+
| values|
+-------+
| height|
|hairCol|
|skinCol|
| weight|
+-------+

but you cannot get them in Seq variable without using collect, that just impossible.但是你不能在不使用 collect 的情况下将它们放入 Seq 变量中，那是不可能的。

从 Scala Spark 中其他列的值创建新列

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-12-12 20:51:32

从 Scala Spark 中其他列的值创建新列

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-12-12 20:51:32

解决方案1
1 已采纳 2022-12-12 20:51:32