简体   繁体   English

Spark Scala:如何转换DF中的列

[英]Spark Scala: How to transform a column in a DF

I have a dataframe in Spark with many columns and a udf that I defined. 我在Spark中有一个数据框,有很多列和我定义的udf。 I want the same dataframe back, except with one column transformed. 我希望返回相同的数据帧,除非转换了一列。 Furthermore, my udf takes in a string and returns a timestamp. 此外,我的udf接受一个字符串并返回一个时间戳。 Is there an easy way to do this? 是否有捷径可寻? I tried 我试过了

val test = myDF.select("my_column").rdd.map(r => getTimestamp(r)) 

but this returns an RDD and just with the transformed column. 但这会返回一个RDD,只返回已转换的列。

If you really need to use your function, I can suggest two options: 如果你真的需要使用你的功能,我可以建议两个选项:

1) Using map / toDF: 1)使用map / toDF:

import org.apache.spark.sql.Row
import sqlContext.implicits._

def getTimestamp: (String => java.sql.Timestamp) = // your function here

val test = myDF.select("my_column").rdd.map {
  case Row(string_val: String) => (string_val, getTimestamp(string_val))
}.toDF("my_column", "new_column")

2) Using UDFs ( UserDefinedFunction ): 2)使用UDF( UserDefinedFunction ):

import org.apache.spark.sql.functions._

def getTimestamp: (String => java.sql.Timestamp) = // your function here

val newCol = udf(getTimestamp).apply(col("my_column")) // creates the new column
val test = myDF.withColumn("new_column", newCol) // adds the new column to original DF

There's more detail about Spark SQL UDFs in this nice article by Bill Chambers . Bill Chambers这篇很好的文章中有关于Spark SQL UDF的更多细节。


Alternatively , 或者

If you just want to transform a StringType column into a TimestampType column you can use the unix_timestamp column function available since Spark SQL 1.5: 如果您只想将StringType列转换为TimestampType列,则可以使用自Spark SQL 1.5以来可用的unix_timestamp 列函数

val test = myDF
  .withColumn("new_column", unix_timestamp(col("my_column"), "yyyy-MM-dd HH:mm").cast("timestamp"))

Note: For spark 1.5.x, it is necessary to multiply the result of unix_timestamp by 1000 before casting to timestamp (issue SPARK-11724 ). 注意:对于spark 1.5.x,必须在转换为timestamp之前将unix_timestamp的结果乘以1000 (发出SPARK-11724 )。 The resulting code would be: 结果代码将是:

val test = myDF
  .withColumn("new_column", (unix_timestamp(col("my_column"), "yyyy-MM-dd HH:mm") *1000L).cast("timestamp"))

Edit: Added udf option 编辑:添加了udf选项

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Scala - Spark - 如何将包含一个字符串列的数据帧转换为具有rigth类型的列的DF? - Scala - Spark - How to transform a dataframe containing one string column to a DF with columns with the rigth type? 如何使用 hashmap 更新/转换/替换 spark df 列值 - how to update/transform/replace spark df column values using a hashmap 如何将 scala 地图数组转换为 Spark df - How to convert scala Array of Maps into Spark df 如何在 Scala Spark DF 中减少和求和网格 - How to reduce and sum grids with in Scala Spark DF 如何在 Spark Scala 中将 RDD 转换为 DF? - How to convert RDD to DF in spark scala? 基于spark scala中完整df的列值设置列值 - Setting the column value based on the column value of complete df in spark scala Spark 2.0如何在DF中将DF Date / timstamp列转换为另一种日期格式? - Spark 2.0 How to convert DF Date/timstamp column to another date format in scala? 如何从 Spark 中的 DF 字符串列中仅取出部分字符串,scala - How to takeout only part of a string, from DF string column in spark, scala 使用另一个 DF 的列过滤 DF(两个 DF 中的列相同) Spark Scala - Filter DF using the column of another DF (same col in both DF) Spark Scala 如何使用Apache Spark和Scala将数据帧的字符串列转换为Array [String]的列 - How to transform a string column of a dataframe into a column of Array[String] with Apache Spark and Scala
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM