簡體   English   中英

從 Scala Spark 中其他列的值創建新列

[英]Create new columns from values of other columns in Scala Spark

我有一個輸入 dataframe:

輸入DF =

+--------------------------+-----------------------------+
| info (String)            |   chars (Seq[String])       |
+--------------------------+-----------------------------+
|weight=100,height=70      | [weight,height]             |
+--------------------------+-----------------------------+
|weight=92,skinCol=white   | [weight,skinCol]            |
+--------------------------+-----------------------------+
|hairCol=gray,skinCol=white| [hairCol,skinCol]           |
+--------------------------+-----------------------------+

如何將此 dataframe 轉換為 output? 我事先不知道 chars 列中包含的字符串是什么

輸出DF =

+--------------------------+-----------------------------+-------+-------+-------+-------+
| info (String)            |   chars (Seq[String])       | weight|height |skinCol|hairCol|
+--------------------------+-----------------------------+-------+-------+-------+-------+
|weight=100,height=70      | [weight,height]             |  100  | 70    | null  |null   |
+--------------------------+-----------------------------+-------+-------+-------+-------+
|weight=92,skinCol=white   | [weight,skinCol]            |  92   |null   |white  |null   |
+--------------------------+-----------------------------+-------+-------+-------+-------+
|hairCol=gray,skinCol=white| [hairCol,skinCol]           |null   |null   |white  |gray   |
+--------------------------+-----------------------------+-------+-------+-------+-------+

我還想將以下 Seq[String] 保存為變量,但不在數據幀上使用.collect() function。

val aVariable: Seq[String] = [weight, height, skinCol, hairCol]

您創建另一個 dataframe 以信息列的鍵為中心,而不是使用 id 列將其連接回去:

import spark.implicits._
val data = Seq(
  ("weight=100,height=70", Seq("weight", "height")),
  ("weight=92,skinCol=white", Seq("weight", "skinCol")),
  ("hairCol=gray,skinCol=white", Seq("hairCol", "skinCol"))
)

val df = spark.sparkContext.parallelize(data).toDF("info", "chars")
  .withColumn("id", monotonically_increasing_id() + 1)

val pivotDf = df
  .withColumn("tmp", split(col("info"), ","))
  .withColumn("tmp", explode(col("tmp")))
  .withColumn("val1", split(col("tmp"), "=")(0))
  .withColumn("val2", split(col("tmp"), "=")(1)).select("id", "val1", "val2")
  .groupBy("id").pivot("val1").agg(first(col("val2")))

df.join(pivotDf, Seq("id"), "left").drop("id").show(false)


+--------------------------+------------------+-------+------+-------+------+
|info                      |chars             |hairCol|height|skinCol|weight|
+--------------------------+------------------+-------+------+-------+------+
|weight=100,height=70      |[weight, height]  |null   |70    |null   |100   |
|hairCol=gray,skinCol=white|[hairCol, skinCol]|gray   |null  |white  |null  |
|weight=92,skinCol=white   |[weight, skinCol] |null   |null  |white  |92    |
+--------------------------+------------------+-------+------+-------+------+

對於第二個問題,您可以像這樣在 dataframe 中獲取這些值:

df.withColumn("tmp", explode(split(col("info"), ",")))
  .withColumn("values", split(col("tmp"), "=")(0)).select("values").distinct().show()

+-------+
| values|
+-------+
| height|
|hairCol|
|skinCol|
| weight|
+-------+

但是你不能在不使用 collect 的情況下將它們放入 Seq 變量中,那是不可能的。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM