[英]How to split column in Spark Dataframe to multiple columns
在我的情况下,如何将格式为'1-1235.0 2-1248.0 3-7895.2'的包含StringType的列拆分为具有['1','2','3']的ArrayType的另一列
this is relatively simple with UDF: 使用UDF相对简单:
val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("input")
val extractFirst = udf((s: String) => s.split(" ").map(_.split('-')(0).toInt))
df.withColumn("newCol", extractFirst($"input"))
.show()
gives 给
+--------------------+---------+
| input| newCol|
+--------------------+---------+
|1-1235.0 2-1248.0...|[1, 2, 3]|
+--------------------+---------+
I could not find an easy soluton with spark internals (other than using split
in combination with explode
etc and then re-aggregating) 我找不到带有火花内部结构的简单解决方案(除了将split
与explode
等结合使用,然后重新进行聚合)
You can split the string to an array using split
function and then you can transform the array using Higher Order Function TRANSFORM
(it is available since Sark 2.4) together with substring_index
: 您可以使用split
函数将字符串拆分为数组,然后可以使用高阶函数TRANSFORM
(自Sark 2.4起可用)将数组与substring_index
一起转换:
import org.apache.spark.sql.functions.{split, expr}
val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("stringCol")
df.withColumn("array", split($"stringCol", " "))
.withColumn("result", expr("TRANSFORM(array, x -> substring_index(x, '-', 1))"))
Notice that this is native approach, no UDF applied. 请注意,这是本机方法,未应用UDF。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.