在现有列的DataFrame中添加新列

Question

I have a csv file with datetime column: "2011-05-02T04:52:09+00:00". 我有一个带有datetime列的csv文件：“2011-05-02T04：52：09 + 00:00”。

I am using scala, the file is loaded into spark DataFrame and I can use jodas time to parse the date: 我正在使用scala，文件被加载到spark DataFrame中，我可以使用jodas时间来解析日期：

val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = new SQLContext(sc).load("com.databricks.spark.csv", Map("path" -> "data.csv", "header" -> "true")) 
val d = org.joda.time.format.DateTimeFormat.forPattern("yyyy-mm-dd'T'kk:mm:ssZ")

I would like to create new columns base on datetime field for timeserie analysis. 我想基于datetime字段创建新列以进行时间序列分析。

In DataFrame, how do I create a column base on value of another column? 在DataFrame中，如何根据另一列的值创建列？

I notice DataFrame has following function: df.withColumn("dt",column), is there a way to create a column base on value of existing column? 我注意到DataFrame具有以下功能：df.withColumn（“dt”，column），有没有办法根据现有列的值创建列？

Thanks 谢谢

Answer 1

import org.apache.spark.sql.types.DateType
import org.apache.spark.sql.functions._
import org.joda.time.DateTime
import org.joda.time.format.DateTimeFormat

val d = DateTimeFormat.forPattern("yyyy-mm-dd'T'kk:mm:ssZ")
val dtFunc: (String => Date) = (arg1: String) => DateTime.parse(arg1, d).toDate
val x = df.withColumn("dt", callUDF(dtFunc, DateType, col("dt_string")))

The callUDF , col are included in functions as the import show callUDF ， col作为import节目包含在functions中

The dt_string inside col("dt_string") is the origin column name of your df, which you want to transform from. 该dt_string内col("dt_string")是您的DF，这要从转变的起源列名。

Alternatively, you could replace the last statement with: 或者，您可以将最后一个语句替换为：

val dtFunc2 = udf(dtFunc)
val x = df.withColumn("dt", dtFunc2(col("dt_string")))

在现有列的DataFrame中添加新列

问题描述

1 个解决方案

解决方案1
7 已采纳 2015-04-28 07:08:40

在现有列的DataFrame中添加新列

问题描述

1 个解决方案

解决方案1 7 已采纳 2015-04-28 07:08:40

解决方案1
7 已采纳 2015-04-28 07:08:40