有条件地转换Spark中的列

Question

假设我有一个如下数据框：

import org.apache.spark.sql.{Row, DataFrame, SparkSession}
import org.apache.spark.sql.types.{StructType, StructField, IntegerType, StringType, DoubleType, NumericType}
import org.apache.spark.sql.functions.{udf, col, skewness}

val someData = Seq(
  Row(8, "bat"),
  Row(64, "mouse"),
  Row(-27, "horse"),
  Row(null, "mouse"),
  Row(27, null)
)

val someSchema = List(
  StructField("number", IntegerType, true),
  StructField("word", StringType, true)
)

val someDF = spark.createDataFrame(
  spark.sparkContext.parallelize(someData),
  StructType(someSchema)
)

val df = someDF.withColumn("constantColumn", lit(1))

我想计算具有类似NumericType类型的每一列的偏度。 然后，如果列的偏度高于某个阈值，我想通过f(x) = log(x + 1)对其进行变换。 （我知道对负数据执行对数转换会产生NaN，但我最终希望编写代码来考虑这种可能性）。

到目前为止我尝试过的是：

我已经找到了一种方法，但是它需要一个可变的数据框df 。 据我有限的理解，这是不可取的。

val log1p = scala.math.log1p(_)
val log1pUDF = udf(scala.math.log1p(_: Double))
val transformThreshold = 0.04

// filter those columns which have a type that inherits from NumericType
val numericColumns = df.columns.filter(column => df.select(column).schema(0).dataType.isInstanceOf[NumericType])

// for columns having NumericType, filter those that are sufficiently skewed
val columnsToTransform = numericColumns.filter(numericColumn => df.select(skewness(df(numericColumn))).head.getDouble(0) > transformThreshold)

// for all columns that are sufficiently skewed, perform log1p transform and add it to df 
for(column <- columnsToTransform) {

   // df should be mutable here!
   df = df.withColumn(column + "_log1p", log1pUDF(df(column))) 
}

我的问题：

在不使用可变数据帧的情况下如何实现目标？
有没有更轻松/更快捷的方法来实现我尝试做的事情？

（在Spark 2.4.0，Scala 2.11.12上运行。）

Answer 1

代替for()结构，可以使用递归函数：

def rec(df: DataFrame, columns: List[String]): DataFrame = columns match {
  case Nil => df
  case h :: xs => rec(df.withColumn(s"${h}_log1p", log1pUDF(col(h))), xs)
}

有条件地转换Spark中的列

问题描述

1 个解决方案

解决方案1
4 已采纳 2019-05-16 14:24:00

有条件地转换Spark中的列

问题描述

1 个解决方案

解决方案1 4 已采纳 2019-05-16 14:24:00

解决方案1
4 已采纳 2019-05-16 14:24:00