简体   繁体   中英

pySpark Timestamp as String to DateTime

I read from a CSV where column time contains a timestamp with miliseconds '1414250523582' When I use TimestampType in schema it returnns NULL. The only way it ready my data is to use StringType.

Now I need this value to be a Datetime for forther processing. First I god rid of the to long timestamp with this:

df2 = df.withColumn("date", col("time")[0:10].cast(IntegerType()))

a schema checks says its a integer now. now i try to make it a datetime with

df3 = df2.withColumn("date", datetime.fromtimestamp(col("time")))

it returns

TypeError: an integer is required (got type Column)

when I google people always just use col("x") to read and transform data, so what do I make wrong here?

The schema checks are a bit tricky; the data in that column may be pyspark.sql.types.IntegerType , but that is not equivalent to Python's int type. The col function returns a pyspark.sql.column.Column object, which often do not play nicely with vanilla Python functions like datetime.fromtimestamp . This explains the TypeError. Even though the "date" data in the actual rows is an integer, col doesn't allow you to access it as an integer to feed into a python function quite so simply. To apply arbitrary Python code to that integer value, you can compile a udf pretty easily, but in this case, pyspark.sql.functions already has a solution for your unix timestamp. Try this: df3 = df2.withColumn("date", from_unixtime(col("time"))) , and you should see a nice date in 2014 for your example.

Small note: This "date" column will be of StringType .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM