[英]Spark DataFrame aggregate column names
I have DataFrame
with following structure:我有DataFrame
具有以下结构:
root
|-- very_hot: string (nullable = true)
|-- hot: string (nullable = true)
|-- cold: string (nullable = true)
|-- little_snow: string (nullable = true)
|-- medium_snow: string (nullable = true)
|-- very_cold: string (nullable = true)
|-- deep_snow: string (nullable = true)
|-- freezing: string (nullable = true)
|-- windy: string (nullable = true)
Each of those columns contains either True
or False
.这些列中的每一列都包含True
或False
。 I want to create a new column with arrays of column names, which are True
.我想用 arrays 的列名创建一个新列,它们是True
。 How can I do it?我该怎么做?
EDIT : Here's the table I have:编辑:这是我的表:
+--------+-----+-----+-----------+-----------+---------+---------+--------+-----+
|very_hot| hot| cold|little_snow|medium_snow|very_cold|deep_snow|freezing|windy|
+--------+-----+-----+-----------+-----------+---------+---------+--------+-----+
| True|False|False| False| False| False| False| False| True|
| False|False| True| True| False| False| False| False|False|
| False|False| True| False| True| False| False| False|False|
| False|False|False| False| False| True| True| False|False|
+--------+-----+-----+-----------+-----------+---------+---------+--------+-----+
The column I want should look like this:我想要的列应该是这样的:
+--------------------+
| features|
+--------------------+
| very_hot, windy|
| cold, little_snow|
| cold, medium_snow|
|very_cold, deep_snow|
+--------------------+
This scala code此 scala 代码
val data = Seq((true, true, false), (true, false, true), (false, true, true))
val df = data.toDF("first", "second", "third")
val names = df.schema.map(_.name).zipWithIndex
df.rdd
.map(r => names
.filter(n => r.getBoolean(n._2))
.map(_._1)
.mkString(",")
).toDF("feature").show
will result in将导致
+------------+
| feature|
+------------+
|first,second|
| first,third|
|second,third|
+------------+
this code might be helpful to you,这段代码可能对你有帮助,
import org.apache.spark.sql.functions._
val df=Seq(("True","False","False","False","False","False","False","False","True"),("False","False","True","True","False","False","False","False","False"),("False","False","True","False","True","False","False","False","False"),("False","False","False","False","False","True","True","False","False")).toDF("very_hot","hot","cold","little_snow","medium_snow","very_cold","deep_snow","freezing","windy")
df.show()
/*
+--------+-----+-----+-----------+-----------+---------+---------+--------+-----+
|very_hot| hot| cold|little_snow|medium_snow|very_cold|deep_snow|freezing|windy|
+--------+-----+-----+-----------+-----------+---------+---------+--------+-----+
| True|False|False| False| False| False| False| False| True|
| False|False| True| True| False| False| False| False|False|
| False|False| True| False| True| False| False| False|False|
| False|False|False| False| False| True| True| False|False|
+--------+-----+-----+-----------+-----------+---------+---------+--------+-----+
*/
val df1=df.withColumn("features", concat_ws(",",
when(col("very_hot").contains("True"), "very_hot"),
when(col("hot").contains("True"), "hot"),
when(col("cold").contains("True"), "cold"),
when(col("little_snow").contains("True"), "little_snow"),
when(col("medium_snow").contains("True"), "medium_snow"),
when(col("very_cold").contains("True"), "very_cold"),
when(col("deep_snow").contains("True"), "deep_snow"),
when(col("freezing").contains("True"), "freezing"),
when(col("windy").contains("True"), "windy")
)).drop("very_hot").drop("hot").drop("cold").drop("little_snow").drop("medium_snow").drop("very_cold").drop("deep_snow").drop("freezing").drop("windy")
df1.show()
/*
+-------------------+
| features|
+-------------------+
| very_hot,windy|
| cold,little_snow|
| cold,medium_snow|
|very_cold,deep_snow|
+-------------------+
*/
Try this.尝试这个。
val df2 = df.withColumn("feature", concat_ws(", ", df.columns.map(c => when(col(c)===lit("True"), c)): _*))
df2.show(false)
+--------+-----+-----+-----------+-----------+---------+---------+--------+-----+--------------------+
|very_hot|hot |cold |little_snow|medium_snow|very_cold|deep_snow|freezing|windy|feature |
+--------+-----+-----+-----------+-----------+---------+---------+--------+-----+--------------------+
|true |false|false|false |false |false |false |false |true |very_hot, windy |
|false |false|true |true |false |false |false |false |false|cold, little_snow |
|false |false|true |false |true |false |false |false |false|cold, medium_snow |
|false |false|false|false |false |true |true |false |false|very_cold, deep_snow|
+--------+-----+-----+-----------+-----------+---------+---------+--------+-----+--------------------+
df2.drop(df.columns: _*).show(false)
+--------------------+
|feature |
+--------------------+
|very_hot, windy |
|cold, little_snow |
|cold, medium_snow |
|very_cold, deep_snow|
+--------------------+
Another alternative-另一种选择——
df2.show(false)
df2.printSchema()
/**
* +--------+-----+-----+-----------+-----------+---------+---------+--------+-----+
* |very_hot|hot |cold |little_snow|medium_snow|very_cold|deep_snow|freezing|windy|
* +--------+-----+-----+-----------+-----------+---------+---------+--------+-----+
* |True |False|False|False |False |False |False |False |True |
* |False |False|True |True |False |False |False |False |False|
* |False |False|True |False |True |False |False |False |False|
* |False |False|False|False |False |True |True |False |False|
* +--------+-----+-----+-----------+-----------+---------+---------+--------+-----+
*
* root
* |-- very_hot: string (nullable = true)
* |-- hot: string (nullable = true)
* |-- cold: string (nullable = true)
* |-- little_snow: string (nullable = true)
* |-- medium_snow: string (nullable = true)
* |-- very_cold: string (nullable = true)
* |-- deep_snow: string (nullable = true)
* |-- freezing: string (nullable = true)
* |-- windy: string (nullable = true)
*/
val columns = df2.columns.map(c => s"named_struct('name', '$c', 'value', `$c`)").mkString(", ")
df2.selectExpr(s"TRANSFORM(FILTER(array($columns), x -> x.value='True'), x -> x.name) as features")
.show(false)
/**
* +----------------------+
* |features |
* +----------------------+
* |[very_hot, windy] |
* |[cold, little_snow] |
* |[cold, medium_snow] |
* |[very_cold, deep_snow]|
* +----------------------+
*/
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.