[英]Spark DataFrame column with comma separated list of other columns that needs to be updated with values given in another column
I have a use case trying to solve in Spark DataFrames.我有一个用例试图在 Spark DataFrames 中解决。 Column "col4" is comma separated string consisting of other columns names that needs to be updated with string values given in column col5.
列“col4”是逗号分隔的字符串,由其他列名称组成,需要使用列 col5 中给出的字符串值进行更新。
+----+----+----+---------+----+
|col1|col2|col3| col4|col5|
+----+----+----+---------+----+
| A| B| C|col2,col3| X,Y|
| P| Q| R| col1| Z|
| I| J| K|col1,col3| S,T|
+----+----+----+---------+----+
After transformation - Resulting DataFrame should looks like below.转换后 - 生成的 DataFrame 应如下所示。 How can I achieve this?
我怎样才能做到这一点?
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| X| Y|
| Z| Q| R|
| S| J| T|
+----+----+----+
Basically i created 2 arrays of col4 and col5 and then used map_from_arrays to create a map, then made a column of those col1,col2,col3 using the map and then used when,otherwise ( when isNotNull ) clauses to change your columns in place.基本上,我创建了 2 个 col4 和 col5 数组,然后使用map_from_arrays创建了一个映射,然后使用映射创建了这些 col1、col2、col3的列,然后使用when,otherwise ( when isNotNull ) 子句来更改您的列。
( spark2.4+ ) (火花2.4+ )
Data数据
df.show()
+----+----+----+---------+----+
|col1|col2|col3| col4|col5|
+----+----+----+---------+----+
| A| B| C|col2,col3| X,Y|
| P| Q| R| col1| Z|
| I| J| K|col1,col3| S,T|
+----+----+----+---------+----+
%scala %scala
import org.apache.spark.sql.functions.{col, map_from_arrays, split, when}
df.withColumn("col6", map_from_arrays(split($"col4",","),split($"col5",","))).drop("col4","col5")
.select($"col1",$"col2",$"col3",col("col6.col1").alias("col1_"),col("col6.col2").alias("col2_"),col("col6.col3").alias("col3_"))
.withColumn("col1", when(col("col1_").isNotNull, col("col1_")).otherwise($"col1"))
.withColumn("col2", when(col("col2_").isNotNull,col("col2_")).otherwise($"col2"))
.withColumn("col3",when(col("col3_").isNotNull,col("col3_")).otherwise($"col3"))
.drop("col1_","col2_","col3_")
.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| X| Y|
| Z| Q| R|
| S| J| T|
+----+----+----+
%python %Python
from pyspark.sql import functions as F
df.withColumn("col6", F.map_from_arrays(F.split("col4",','),F.split("col5",','))).drop("col4","col5")\
.select("col1","col2","col3",F.col("col6.col1").alias("col1_"),F.col("col6.col2").alias("col2_"),F.col("col6.col3").alias("col3_"))\
.withColumn("col1", F.when(F.col("col1_").isNotNull(), F.col("col1_")).otherwise(F.col("col1")))\
.withColumn("col2", F.when(F.col("col2_").isNotNull(),F.col("col2_")).otherwise(F.col("col2")))\
.withColumn("col3",F.when(F.col("col3_").isNotNull(),F.col("col3_")).otherwise(F.col("col3")))\
.drop("col1_","col2_","col3_")\
.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| X| Y|
| Z| Q| R|
| S| J| T|
+----+----+----+
UPDATE: This will work for spark 2.0+ ( without map_from_array ):更新:这适用于spark 2.0+ (没有 map_from_array ):
(you could make a scala udf and apply similar logic, hope it helps) (你可以制作一个Scala udf并应用类似的逻辑,希望它有所帮助)
%python %Python
from pyspark.sql import functions as F
from pyspark.sql.functions import udf
@udf("map<string,string>")
def as_dict(x):
return dict(zip(*x)) if x else None
df.withColumn("col6", F.array(F.split(("col4"),','),F.split(("col5"),','))).drop("col4","col5")\
.withColumn("col6", as_dict("col6")).select("col1","col2","col3",F.col("col6.col1").alias("col1_"),F.col("col6.col2").alias("col2_"),F.col("col6.col3").alias("col3_"))\
.withColumn("col1", F.when(F.col("col1_").isNotNull(), F.col("col1_")).otherwise(F.col("col1")))\
.withColumn("col2", F.when(F.col("col2_").isNotNull(),F.col("col2_")).otherwise(F.col("col2")))\
.withColumn("col3",F.when(F.col("col3_").isNotNull(),F.col("col3_")).otherwise(F.col("col3")))\
.drop("col1_","col2_","col3_")\
.show()
Spark 2.4+火花 2.4+
If the columns are not only 3, then it should be scalable for more columns.如果列不仅仅是 3,那么它应该可以扩展到更多列。 I have made this code to expand easily.
我让这段代码很容易扩展。
val cols = Seq("col1", "col2", "col3")
val df1 = df.withColumn("id", monotonically_increasing_id)
val df2 = cols.foldLeft(
df1.withColumn("col6", explode(arrays_zip(split($"col4", ","),split($"col5", ","))))
.groupBy("id").pivot($"col6.0").agg(first($"col6.1"))
) {(df, c) => df.withColumnRenamed(c, c + "2")}
cols.foldLeft(df1.join(df2, "id")) {(df, c) => df.withColumn(c, coalesce(col(c + "2"), col(c)))}
.select(cols.head, cols.tail: _*)
.show
The result is:结果是:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| X| Y|
| Z| Q| R|
| S| J| T|
+----+----+----+
This a problem which can be easilly handled with the map
function of RDDs:这是一个可以通过 RDD 的
map
函数轻松处理的问题:
import org.apache.spark.sql.types.{StructType, StructField, StringType}
val targetColumns = df.columns.take(3) // we assume that the final df should contain 3 first elements. If not feel free to modify this accordingly to your requirements
val updatedRDD = df.rdd.map{ r =>
val keys = r.getAs[String]("col4").split(",")
val values = r.getAs[String]("col5").split(",")
val mapping = keys.zip(values).toMap[String, String] // i.e: Map(col2 -> X, col3 -> Y)
val updatedValues = targetColumns.map{c =>
if(keys.contains(c))
mapping(c)
else
r.getAs[String](c)
}
Row(updatedValues:_*)
}
val schema = StructType(targetColumns.map{c => StructField(c, StringType, true)})
spark.createDataFrame(updatedRDD, schema).show(false)
// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// |A |X |Y |
// |Z |Q |R |
// |S |J |T |
// +----+----+----+
We create a map using col4->keys, col5->values
which is used to create the final Row
that will be returned.我们使用创建地图
col4->keys, col5->values
是用来创建最终的Row
会被退回。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.