pySpark Timestamp as String to DateTime

Question

I read from a CSV where column time contains a timestamp with miliseconds '1414250523582' When I use TimestampType in schema it returnns NULL. The only way it ready my data is to use StringType.

Now I need this value to be a Datetime for forther processing. First I god rid of the to long timestamp with this:

df2 = df.withColumn("date", col("time")[0:10].cast(IntegerType()))

a schema checks says its a integer now. now i try to make it a datetime with

df3 = df2.withColumn("date", datetime.fromtimestamp(col("time")))

it returns

TypeError: an integer is required (got type Column)

when I google people always just use col("x") to read and transform data, so what do I make wrong here?

Answer 1

The schema checks are a bit tricky; the data in that column may be pyspark.sql.types.IntegerType , but that is not equivalent to Python's int type. The col function returns a pyspark.sql.column.Column object, which often do not play nicely with vanilla Python functions like datetime.fromtimestamp . This explains the TypeError. Even though the "date" data in the actual rows is an integer, col doesn't allow you to access it as an integer to feed into a python function quite so simply. To apply arbitrary Python code to that integer value, you can compile a udf pretty easily, but in this case, pyspark.sql.functions already has a solution for your unix timestamp. Try this: df3 = df2.withColumn("date", from_unixtime(col("time"))) , and you should see a nice date in 2014 for your example.

Small note: This "date" column will be of StringType .

pySpark Timestamp as String to DateTime

Question

1 answers

solution1
1 ACCPTED 2021-04-01 03:29:42

pySpark Timestamp as String to DateTime

Question

1 answers

solution1 1 ACCPTED 2021-04-01 03:29:42

solution1
1 ACCPTED 2021-04-01 03:29:42