如何将Spark Dataframe中的列拆分为多列

Question

在我的情况下，如何将格式为'1-1235.0 2-1248.0 3-7895.2'的包含StringType的列拆分为具有['1'，'2'，'3']的ArrayType的另一列

Answer 1

this is relatively simple with UDF: 使用UDF相对简单：

val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("input")

val extractFirst = udf((s: String) => s.split(" ").map(_.split('-')(0).toInt))

df.withColumn("newCol", extractFirst($"input"))
  .show()

gives 给

+--------------------+---------+
|               input|   newCol|
+--------------------+---------+
|1-1235.0 2-1248.0...|[1, 2, 3]|
+--------------------+---------+

I could not find an easy soluton with spark internals (other than using split in combination with explode etc and then re-aggregating) 我找不到带有火花内部结构的简单解决方案（除了将split与explode等结合使用，然后重新进行聚合）

Answer 2

You can split the string to an array using split function and then you can transform the array using Higher Order Function TRANSFORM (it is available since Sark 2.4) together with substring_index : 您可以使用split函数将字符串拆分为数组，然后可以使用高阶函数TRANSFORM （自Sark 2.4起可用）将数组与substring_index一起转换：

import org.apache.spark.sql.functions.{split, expr}

val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("stringCol")

df.withColumn("array", split($"stringCol", " "))
  .withColumn("result", expr("TRANSFORM(array, x -> substring_index(x, '-', 1))"))

Notice that this is native approach, no UDF applied. 请注意，这是本机方法，未应用UDF。

如何将Spark Dataframe中的列拆分为多列

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-08-18 19:21:05

解决方案2
1 2019-08-18 19:51:34

如何将Spark Dataframe中的列拆分为多列

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-08-18 19:21:05

解决方案2 1 2019-08-18 19:51:34

解决方案1
1 已采纳 2019-08-18 19:21:05

解决方案2
1 2019-08-18 19:51:34