简体   繁体   English

Spark DataFrame 列带有逗号分隔的其他列列表,这些列需要使用另一列中给定的值进行更新

[英]Spark DataFrame column with comma separated list of other columns that needs to be updated with values given in another column

I have a use case trying to solve in Spark DataFrames.我有一个用例试图在 Spark DataFrames 中解决。 Column "col4" is comma separated string consisting of other columns names that needs to be updated with string values given in column col5.列“col4”是逗号分隔的字符串,由其他列名称组成,需要使用列 col5 中给出的字符串值进行更新。

+----+----+----+---------+----+
|col1|col2|col3|     col4|col5|
+----+----+----+---------+----+
|   A|   B|   C|col2,col3| X,Y|
|   P|   Q|   R|     col1|   Z|
|   I|   J|   K|col1,col3| S,T|
+----+----+----+---------+----+

After transformation - Resulting DataFrame should looks like below.转换后 - 生成的 DataFrame 应如下所示。 How can I achieve this?我怎样才能做到这一点?

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   A|   X|   Y|
|   Z|   Q|   R|
|   S|   J|   T|
+----+----+----+

Basically i created 2 arrays of col4 and col5 and then used map_from_arrays to create a map, then made a column of those col1,col2,col3 using the map and then used when,otherwise ( when isNotNull ) clauses to change your columns in place.基本上,我创建了 2 个 col4 和 col5 数组,然后使用map_from_arrays创建了一个映射,然后使用映射创建了这些 col1、col2、col3的列,然后使用when,otherwise ( when isNotNull ) 子句来更改您的列。

( spark2.4+ ) 火花2.4+

Data数据

df.show()

+----+----+----+---------+----+
|col1|col2|col3|     col4|col5|
+----+----+----+---------+----+
|   A|   B|   C|col2,col3| X,Y|
|   P|   Q|   R|     col1|   Z|
|   I|   J|   K|col1,col3| S,T|
+----+----+----+---------+----+

%scala %scala

import org.apache.spark.sql.functions.{col, map_from_arrays, split, when}

df.withColumn("col6", map_from_arrays(split($"col4",","),split($"col5",","))).drop("col4","col5")
.select($"col1",$"col2",$"col3",col("col6.col1").alias("col1_"),col("col6.col2").alias("col2_"),col("col6.col3").alias("col3_"))
.withColumn("col1", when(col("col1_").isNotNull, col("col1_")).otherwise($"col1"))
.withColumn("col2", when(col("col2_").isNotNull,col("col2_")).otherwise($"col2"))
.withColumn("col3",when(col("col3_").isNotNull,col("col3_")).otherwise($"col3"))
.drop("col1_","col2_","col3_")
.show()

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   A|   X|   Y|
|   Z|   Q|   R|
|   S|   J|   T|
+----+----+----+  

%python %Python

from pyspark.sql import functions as F

df.withColumn("col6", F.map_from_arrays(F.split("col4",','),F.split("col5",','))).drop("col4","col5")\
.select("col1","col2","col3",F.col("col6.col1").alias("col1_"),F.col("col6.col2").alias("col2_"),F.col("col6.col3").alias("col3_"))\
.withColumn("col1", F.when(F.col("col1_").isNotNull(), F.col("col1_")).otherwise(F.col("col1")))\
.withColumn("col2", F.when(F.col("col2_").isNotNull(),F.col("col2_")).otherwise(F.col("col2")))\
.withColumn("col3",F.when(F.col("col3_").isNotNull(),F.col("col3_")).otherwise(F.col("col3")))\
.drop("col1_","col2_","col3_")\
.show()


+----+----+----+
|col1|col2|col3|
+----+----+----+
|   A|   X|   Y|
|   Z|   Q|   R|
|   S|   J|   T|
+----+----+----+

UPDATE: This will work for spark 2.0+ ( without map_from_array ):更新:这适用于spark 2.0+没有 map_from_array ):

(you could make a scala udf and apply similar logic, hope it helps) (你可以制作一个Scala udf并应用类似的逻辑,希望它有所帮助)

%python %Python

from pyspark.sql import functions as F
from pyspark.sql.functions import udf


@udf("map<string,string>")
def as_dict(x):
    return dict(zip(*x)) if x else None


df.withColumn("col6", F.array(F.split(("col4"),','),F.split(("col5"),','))).drop("col4","col5")\
.withColumn("col6", as_dict("col6")).select("col1","col2","col3",F.col("col6.col1").alias("col1_"),F.col("col6.col2").alias("col2_"),F.col("col6.col3").alias("col3_"))\
.withColumn("col1", F.when(F.col("col1_").isNotNull(), F.col("col1_")).otherwise(F.col("col1")))\
.withColumn("col2", F.when(F.col("col2_").isNotNull(),F.col("col2_")).otherwise(F.col("col2")))\
.withColumn("col3",F.when(F.col("col3_").isNotNull(),F.col("col3_")).otherwise(F.col("col3")))\
.drop("col1_","col2_","col3_")\
.show()

Spark 2.4+火花 2.4+

If the columns are not only 3, then it should be scalable for more columns.如果列不仅仅是 3,那么它应该可以扩展到更多列。 I have made this code to expand easily.我让这段代码很容易扩展。

val cols = Seq("col1", "col2", "col3")

val df1 = df.withColumn("id", monotonically_increasing_id)
val df2 = cols.foldLeft(
    df1.withColumn("col6", explode(arrays_zip(split($"col4", ","),split($"col5", ","))))
             .groupBy("id").pivot($"col6.0").agg(first($"col6.1"))
) {(df, c) => df.withColumnRenamed(c, c + "2")}

cols.foldLeft(df1.join(df2, "id")) {(df, c) => df.withColumn(c, coalesce(col(c + "2"), col(c)))}
  .select(cols.head, cols.tail: _*)
  .show

The result is:结果是:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   A|   X|   Y|
|   Z|   Q|   R|
|   S|   J|   T|
+----+----+----+

This a problem which can be easilly handled with the map function of RDDs:这是一个可以通过 RDD 的map函数轻松处理的问题:

import org.apache.spark.sql.types.{StructType, StructField, StringType}

val targetColumns = df.columns.take(3) // we assume that the final df should contain 3 first elements. If not feel free to modify this accordingly to your requirements

val updatedRDD = df.rdd.map{ r => 
  val keys = r.getAs[String]("col4").split(",")
  val values = r.getAs[String]("col5").split(",")
  val mapping = keys.zip(values).toMap[String, String] // i.e: Map(col2 -> X, col3 -> Y)

  val updatedValues = targetColumns.map{c =>   
    if(keys.contains(c))
      mapping(c)
    else
      r.getAs[String](c)
  }

  Row(updatedValues:_*)
}

val schema = StructType(targetColumns.map{c => StructField(c, StringType, true)})
spark.createDataFrame(updatedRDD, schema).show(false)

// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// |A   |X   |Y   |
// |Z   |Q   |R   |
// |S   |J   |T   |
// +----+----+----+

We create a map using col4->keys, col5->values which is used to create the final Row that will be returned.我们使用创建地图col4->keys, col5->values是用来创建最终的Row会被退回。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从列中的给定列中获取逗号分隔的相应值列表 - Get comma separated list of corresponding values from given columns in a column 如何将 Spark dataframe 列与另一个 dataframe 列值进行比较 - How to compare Spark dataframe columns with another dataframe column values 将数据框列转换为 spark scala 中的逗号分隔值 - Convert dataframe column to a comma separated value in spark scala 使用数据框中多个其他列的值将新列添加到Dataframe - spark / scala - Adding a new column to a Dataframe by using the values of multiple other columns in the dataframe - spark/scala 在 Apache Spark 中将 Dataframe 的列值提取为 List - Extract column values of Dataframe as List in Apache Spark 将Spark数据框列的不同值转换为列表 - Converting distinct values of a Spark dataframe column into a list 根据Spark Scala数据框中其他列的值和顺序添加派生列(作为结构数组) - Add derived column (as array of struct) based on values and ordering of other columns in Spark Scala dataframe scala/spark - 将数据框分组并从其他列中选择值作为数据框 - scala/spark - group dataframe and select values from other column as dataframe 用Spark数据框中的另一个分类列的平均值替换列的Null值 - Replace Null Values of a Column with mean of another Categorcial Column in Spark Dataframe 在Spark中读取带有最后一列的CSV作为值数组(并且值在括号内并用逗号分隔) - Read CSV with last column as array of values (and the values are inside parenthesis and separated by comma) in Spark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM