简体   繁体   English

Spark / scala-我们可以从数据框中的现有列值创建新列吗

[英]Spark/scala - can we create new columns from an existing column value in a dataframe

I am trying to see if we can create new columns from value in one of the columns in a dataFrame using spark/scala. 我正在尝试看看是否可以使用spark / scala从dataFrame中的某一列中的值创建新列。 I have a dataframe with following data in it 我有一个包含以下数据的数据框

df.show()

+---+-----------------------+
|id |allvals                |
+---+-----------------------+
|1  |col1,val11|col3,val31  |
|3  |col3,val33|col1,val13  |
|2  |col2,val22             |
+---+-----------------------+

In the above data col1/col2/col3 are the column names followed by it's value. 在上面的数据中,col1 / col2 / col3是列名,后跟它的值。 Column name and value are separated by , . 列名和值之间以,分隔。 Each set is separated by | 每组之间用|分隔| .

Now, I want to achieve like this 现在,我想实现这样的目标

+---+----------------------+------+------+------+
|id |allvals               |col1  |col2  |col3  |
+---+----------------------+------+------+------+
|1  |col1,val11|col3,val31 |val11 |null  |val31 |
|3  |col3,val33|col1,val13 |val13 |null  |val13 |
|2  |col2,val22            |null  |val22 |null  |
+---+----------------------+------+------+------+

Appreciate any help. 感谢任何帮助。

You can convert column to Map with udf : 您可以使用udf将column转换为Map

import org.apache.spark.sql.functions._
import spark.implicits._

val df = Seq(
  (1, "col1,val11|col3,val31"), (2, "col3,val33|col3,val13"), (2, "col2,val22")
).toDF("id", "allvals")

val to_map = udf((s: String) => s.split('|').collect { _.split(",") match {
  case Array(k, v) => (k, v)
}}.toMap )

val dfWithMap = df.withColumn("allvalsmap", to_map($"allvals"))
val keys = dfWithMap.select($"allvalsmap").as[Map[String, String]].flatMap(_.keys.toSeq).distinct.collect

keys.foldLeft(dfWithMap)((df, k) => df.withColumn(k, $"allvalsmap".getItem(k))).drop("allvalsmap").show
// +---+--------------------+-----+-----+-----+
// | id|             allvals| col3| col1| col2|
// +---+--------------------+-----+-----+-----+
// |  1|col1,val11|col3,v...|val31|val11| null|
// |  2|col3,val33|col3,v...|val13| null| null|
// |  2|          col2,val22| null| null|val22|
// +---+--------------------+-----+-----+-----+

Inspired by this answer by user6910411 . 受到user6910411 这个答案启发

You can transform the DataFrame using split , explode and groupBy/pivot/agg , as follows: 您可以使用splitexplodegroupBy/pivot/agg来转换DataFrame,如下所示:

val df = Seq(
  (1, "col1,val11|col3,val31"),
  (2, "col3,val33|col1,val13"),
  (3, "col2,val22")
).toDF("id", "allvals")

import org.apache.spark.sql.functions._

df.withColumn("temp", split($"allvals", "\\|")).
  withColumn("temp", explode($"temp")).
  withColumn("temp", split($"temp", ",")).
  select($"id", $"allvals", $"temp".getItem(0).as("k"), $"temp".getItem(1).as("v")).
  groupBy($"id", $"allvals").pivot("k").agg(first($"v"))

// +---+---------------------+-----+-----+-----+
// |id |allvals              |col1 |col2 |col3 |
// +---+---------------------+-----+-----+-----+
// |1  |col1,val11|col3,val31|val11|null |val31|
// |3  |col2,val22           |null |val22|null |
// |2  |col3,val33|col1,val13|val13|null |val33|
// +---+---------------------+-----+-----+-----+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark Scala Dataframe如何使用两个或多个现有列创建新列 - Spark Scala Dataframe How to create new column with two or more existing columns Scala:对于数据框上的循环,从现有索引创建新列 - Scala: For loop on dataframe, create new column from existing by index 使用现有的 Integer 列在 Spark Scala ZC699575A5E8AFD9E22A7AECC1 - Create New Column with range of integer by using existing Integer Column in Spark Scala Dataframe 使用 scala 根据 Spark DataFrame 中现有列的聚合添加新列 - Adding new Columns based on aggregation on existing column in Spark DataFrame using scala 使用其他现有列 Spark/Scala 添加新列 - Adding new column using other existing columns Spark/Scala Scala Spark Dataframe 创建一个新列,其中包含另一列的最大值和当前值 - Scala Spark Dataframe Create a new column with maximum of previous and current value of another column 根据列数创建具有新行的新DataFrame-Spark Scala - Create new DataFrame with new rows depending in number of a column - Spark Scala 将具有文字值的新列添加到 Spark Scala 中 Dataframe 中的结构列 - Add new column with literal value to a struct column in Dataframe in Spark Scala 有没有一种方法可以从Scala中的数据框的现有列创建多个列? - Is there a way to create multiple columns from existing columns of a dataframe in Scala? 数据帧中的列作为键,列数据作为值按Spark Scala中的id分组 - dataframe columns as key and column data as value group by id in spark scala
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM