简体   繁体   中英

Specify format of Timestamp written by pyspark

Background is a simply pyspark programme that I developed on 1.6 using databricks csv read/writer, and all was happy. My dataframe had a timestamp column, which was written out in a standard YYYY-MM-DD HH24:MI:SS format.

foo,bar,2016-10-14 14:30:31.985 

Now I'm running it on EMR with Spark 2, and the timestamp column is being written as an epoch in microseconds . This causes a problem because the target (Redshift) can't natively handle this (only seconds or milliseconds).

foo,bar,1476455559456000

Looking at the docs , it seems I should be able to specify the format used with timestampFormat , but I just get an error :

TypeError: csv() got an unexpected keyword argument 'timestampFormat'

Am I calling this wrong, or does the option not exist? Any other way to cleanly get my timestamp data out in a format that's not microseconds (milli would be fine, or any other standard time format really)


Simple code to reproduce:

df = sqlContext.createDataFrame([('foo','bar')]).withColumn('foo',pyspark.sql.functions.current_timestamp())
df.printSchema()
df.show()

# Use the new Spark 2 native method
df.write.csv(path='/tmp/foo',mode='overwrite')

# Use the databricks CSV method, pre Spark 2
df.write.save(path='/tmp/foo2',format='com.databricks.spark.csv',mode='overwrite')

原来我看的文档是2.0.1的文档,而我在2.0.0上运行-并且timestampFormat是2.0.1中的新增功能。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM