简体   繁体   English

如何将Spark Dataframe中的列拆分为多列

[英]How to split column in Spark Dataframe to multiple columns

在我的情况下,如何将格式为'1-1235.0 2-1248.0 3-7895.2'的包含StringType的列拆分为具有['1','2','3']的ArrayType的另一列

this is relatively simple with UDF: 使用UDF相对简单:

val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("input")

val extractFirst = udf((s: String) => s.split(" ").map(_.split('-')(0).toInt))

df.withColumn("newCol", extractFirst($"input"))
  .show()

gives

+--------------------+---------+
|               input|   newCol|
+--------------------+---------+
|1-1235.0 2-1248.0...|[1, 2, 3]|
+--------------------+---------+

I could not find an easy soluton with spark internals (other than using split in combination with explode etc and then re-aggregating) 我找不到带有火花内部结构的简单解决方案(除了将splitexplode等结合使用,然后重新进行聚合)

You can split the string to an array using split function and then you can transform the array using Higher Order Function TRANSFORM (it is available since Sark 2.4) together with substring_index : 您可以使用split函数将字符串拆分为数组,然后可以使用高阶函数TRANSFORM (自Sark 2.4起可用)将数组与substring_index一起转换:

import org.apache.spark.sql.functions.{split, expr}

val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("stringCol")

df.withColumn("array", split($"stringCol", " "))
  .withColumn("result", expr("TRANSFORM(array, x -> substring_index(x, '-', 1))"))

Notice that this is native approach, no UDF applied. 请注意,这是本机方法,未应用UDF。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM