Specify format of Timestamp written by pyspark

Question

Background is a simply pyspark programme that I developed on 1.6 using databricks csv read/writer, and all was happy. My dataframe had a timestamp column, which was written out in a standard YYYY-MM-DD HH24:MI:SS format.

foo,bar,2016-10-14 14:30:31.985

Now I'm running it on EMR with Spark 2, and the timestamp column is being written as an epoch in microseconds . This causes a problem because the target (Redshift) can't natively handle this (only seconds or milliseconds).

foo,bar,1476455559456000

Looking at the docs , it seems I should be able to specify the format used with timestampFormat , but I just get an error :

TypeError: csv() got an unexpected keyword argument 'timestampFormat'

Am I calling this wrong, or does the option not exist? Any other way to cleanly get my timestamp data out in a format that's not microseconds (milli would be fine, or any other standard time format really)

Simple code to reproduce:

df = sqlContext.createDataFrame([('foo','bar')]).withColumn('foo',pyspark.sql.functions.current_timestamp())
df.printSchema()
df.show()

# Use the new Spark 2 native method
df.write.csv(path='/tmp/foo',mode='overwrite')

# Use the databricks CSV method, pre Spark 2
df.write.save(path='/tmp/foo2',format='com.databricks.spark.csv',mode='overwrite')

Answer 1

原来我看的文档是2.0.1的文档，而我在2.0.0上运行-并且timestampFormat是2.0.1中的新增功能。

Specify format of Timestamp written by pyspark

Question

1 answers

solution1
0 ACCPTED 2016-10-17 13:18:58

Specify format of Timestamp written by pyspark

Question

1 answers

solution1 0 ACCPTED 2016-10-17 13:18:58

solution1
0 ACCPTED 2016-10-17 13:18:58