Spark DataFrame 列带有逗号分隔的其他列列表，这些列需要使用另一列中给定的值进行更新

Question

I have a use case trying to solve in Spark DataFrames.我有一个用例试图在 Spark DataFrames 中解决。 Column "col4" is comma separated string consisting of other columns names that needs to be updated with string values given in column col5.列“col4”是逗号分隔的字符串，由其他列名称组成，需要使用列 col5 中给出的字符串值进行更新。

+----+----+----+---------+----+
|col1|col2|col3|     col4|col5|
+----+----+----+---------+----+
|   A|   B|   C|col2,col3| X,Y|
|   P|   Q|   R|     col1|   Z|
|   I|   J|   K|col1,col3| S,T|
+----+----+----+---------+----+

After transformation - Resulting DataFrame should looks like below.转换后 - 生成的 DataFrame 应如下所示。 How can I achieve this?我怎样才能做到这一点？

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   A|   X|   Y|
|   Z|   Q|   R|
|   S|   J|   T|
+----+----+----+

Answer 1

Basically i created 2 arrays of col4 and col5 and then used map_from_arrays to create a map, then made a column of those col1,col2,col3 using the map and then used when,otherwise ( when isNotNull ) clauses to change your columns in place.基本上，我创建了 2 个 col4 和 col5 数组，然后使用map_from_arrays创建了一个映射，然后使用映射创建了这些 col1、col2、col3的列，然后使用when,otherwise ( when isNotNull ) 子句来更改您的列。

( spark2.4+ ) （火花2.4+ ）

Data数据

df.show()

+----+----+----+---------+----+
|col1|col2|col3|     col4|col5|
+----+----+----+---------+----+
|   A|   B|   C|col2,col3| X,Y|
|   P|   Q|   R|     col1|   Z|
|   I|   J|   K|col1,col3| S,T|
+----+----+----+---------+----+

%scala %scala

import org.apache.spark.sql.functions.{col, map_from_arrays, split, when}

df.withColumn("col6", map_from_arrays(split($"col4",","),split($"col5",","))).drop("col4","col5")
.select($"col1",$"col2",$"col3",col("col6.col1").alias("col1_"),col("col6.col2").alias("col2_"),col("col6.col3").alias("col3_"))
.withColumn("col1", when(col("col1_").isNotNull, col("col1_")).otherwise($"col1"))
.withColumn("col2", when(col("col2_").isNotNull,col("col2_")).otherwise($"col2"))
.withColumn("col3",when(col("col3_").isNotNull,col("col3_")).otherwise($"col3"))
.drop("col1_","col2_","col3_")
.show()

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   A|   X|   Y|
|   Z|   Q|   R|
|   S|   J|   T|
+----+----+----+

%python ％Python

from pyspark.sql import functions as F

df.withColumn("col6", F.map_from_arrays(F.split("col4",','),F.split("col5",','))).drop("col4","col5")\
.select("col1","col2","col3",F.col("col6.col1").alias("col1_"),F.col("col6.col2").alias("col2_"),F.col("col6.col3").alias("col3_"))\
.withColumn("col1", F.when(F.col("col1_").isNotNull(), F.col("col1_")).otherwise(F.col("col1")))\
.withColumn("col2", F.when(F.col("col2_").isNotNull(),F.col("col2_")).otherwise(F.col("col2")))\
.withColumn("col3",F.when(F.col("col3_").isNotNull(),F.col("col3_")).otherwise(F.col("col3")))\
.drop("col1_","col2_","col3_")\
.show()


+----+----+----+
|col1|col2|col3|
+----+----+----+
|   A|   X|   Y|
|   Z|   Q|   R|
|   S|   J|   T|
+----+----+----+

UPDATE: This will work for spark 2.0+ ( without map_from_array ):更新：这适用于spark 2.0+ （没有 map_from_array ）：

(you could make a scala udf and apply similar logic, hope it helps) （你可以制作一个Scala udf并应用类似的逻辑，希望它有所帮助）

%python ％Python

from pyspark.sql import functions as F
from pyspark.sql.functions import udf


@udf("map<string,string>")
def as_dict(x):
    return dict(zip(*x)) if x else None


df.withColumn("col6", F.array(F.split(("col4"),','),F.split(("col5"),','))).drop("col4","col5")\
.withColumn("col6", as_dict("col6")).select("col1","col2","col3",F.col("col6.col1").alias("col1_"),F.col("col6.col2").alias("col2_"),F.col("col6.col3").alias("col3_"))\
.withColumn("col1", F.when(F.col("col1_").isNotNull(), F.col("col1_")).otherwise(F.col("col1")))\
.withColumn("col2", F.when(F.col("col2_").isNotNull(),F.col("col2_")).otherwise(F.col("col2")))\
.withColumn("col3",F.when(F.col("col3_").isNotNull(),F.col("col3_")).otherwise(F.col("col3")))\
.drop("col1_","col2_","col3_")\
.show()

Answer 2

Spark 2.4+火花 2.4+

If the columns are not only 3, then it should be scalable for more columns.如果列不仅仅是 3，那么它应该可以扩展到更多列。 I have made this code to expand easily.我让这段代码很容易扩展。

val cols = Seq("col1", "col2", "col3")

val df1 = df.withColumn("id", monotonically_increasing_id)
val df2 = cols.foldLeft(
    df1.withColumn("col6", explode(arrays_zip(split($"col4", ","),split($"col5", ","))))
             .groupBy("id").pivot($"col6.0").agg(first($"col6.1"))
) {(df, c) => df.withColumnRenamed(c, c + "2")}

cols.foldLeft(df1.join(df2, "id")) {(df, c) => df.withColumn(c, coalesce(col(c + "2"), col(c)))}
  .select(cols.head, cols.tail: _*)
  .show

The result is:结果是：

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   A|   X|   Y|
|   Z|   Q|   R|
|   S|   J|   T|
+----+----+----+

Answer 3

This a problem which can be easilly handled with the map function of RDDs:这是一个可以通过 RDD 的map函数轻松处理的问题：

import org.apache.spark.sql.types.{StructType, StructField, StringType}

val targetColumns = df.columns.take(3) // we assume that the final df should contain 3 first elements. If not feel free to modify this accordingly to your requirements

val updatedRDD = df.rdd.map{ r => 
  val keys = r.getAs[String]("col4").split(",")
  val values = r.getAs[String]("col5").split(",")
  val mapping = keys.zip(values).toMap[String, String] // i.e: Map(col2 -> X, col3 -> Y)

  val updatedValues = targetColumns.map{c =>   
    if(keys.contains(c))
      mapping(c)
    else
      r.getAs[String](c)
  }

  Row(updatedValues:_*)
}

val schema = StructType(targetColumns.map{c => StructField(c, StringType, true)})
spark.createDataFrame(updatedRDD, schema).show(false)

// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// |A   |X   |Y   |
// |Z   |Q   |R   |
// |S   |J   |T   |
// +----+----+----+

We create a map using col4->keys, col5->values which is used to create the final Row that will be returned.我们使用创建地图col4->keys, col5->values是用来创建最终的Row会被退回。

Spark DataFrame 列带有逗号分隔的其他列列表，这些列需要使用另一列中给定的值进行更新

问题描述

3 个解决方案

解决方案1
1 2020-03-12 18:52:53

解决方案2
0 2020-03-13 02:44:21

解决方案3
0 2020-03-13 11:53:31

Spark DataFrame 列带有逗号分隔的其他列列表，这些列需要使用另一列中给定的值进行更新

问题描述

3 个解决方案

解决方案1 1 2020-03-12 18:52:53

解决方案2 0 2020-03-13 02:44:21

解决方案3 0 2020-03-13 11:53:31

解决方案1
1 2020-03-12 18:52:53

解决方案2
0 2020-03-13 02:44:21

解决方案3
0 2020-03-13 11:53:31