简体   繁体   中英

PySpark Use regex function to fix timestamp column

Using PySpark,

I've got a string which looks like:

+-------------------------+
|2022-12-07050641         |
+-------------------------+

But need it to be in this format:

+-------------------------+
|2022-11-11 08:48:00.707  |
+-------------------------+

It seems that the to_timestamp() function requires the formatting to be in the format of a timestamp.

I've been trying to use the to_timestamp() function to convert the string to timestamp but the value then returns nulls. I figured its because of the format of the value (2022-12-07050641). How can I use regex to fix my value to be as the desired value?

To use the to_timestamp function in PySpark to convert the string '2022-12-07050641' to a timestamp, you can use a regular expression to extract the date and time parts of the string, and then use the to_timestamp function to convert them to a timestamp.

import re
from pyspark.sql.functions import to_timestamp, regexp_extract

# Define the regular expression pattern to extract the date and time parts
pattern = r'(\d{4}-\d{2}-\d{2})(\d{6})'

# Extract the date and time parts using the regular expression
df = df.withColumn('date', regexp_extract('string_column', pattern, 1))
df = df.withColumn('time', regexp_extract('string_column', pattern, 2))

# Convert the date and time parts to a timestamp
df = df.withColumn('timestamp', to_timestamp(df['date'] + ' ' + df['time'], 'yyyy-MM-dd HHmmss'))

# Drop the date and time columns
df = df.drop('date').drop('time')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM