How to filter DataFrame to select only the last 12 hours?

Question

I have a PySpark DataFrame with different fields including the field timestamp .

sdata = sc.parallelize([ 
[('id', 1), ('timestamp', 1506339960), ('data_pk', 111)],
[('id', 2), ('timestamp', 1506340140), ('data_pk', 222)],
...
])
# Convert to tuple
sdata_converted = sdata.map(lambda x: (x[0][1], x[1][1], x[2][1]))

# Define schema
sschema = StructType([
    StructField("id", LongType(), True),
    StructField("timestamp", LongType(), True),
    StructField("data_pk", LongType(), True)
])

df = sqlContext.createDataFrame(sdata_converted, sschema)

How can I select only those rows that refer to the last 12 hours?

Answer 1

First get timestamp of 12 hours ago,

import datetime

twelve_hours = (datetime.datetime.now() - 
                datetime.timedelta(hours = 12)) \
                .strftime("%Y-%m-%d %H:%M:%S")

then use it in a filter statement:

df_new = df.withColumn("timestamp", df.timestamp.cast("timestamp")) \
           .filter(df.timestamp > twelve_hours)

How to filter DataFrame to select only the last 12 hours?

Question

1 answers

solution1
3 ACCPTED 2017-11-08 13:56:55

How to filter DataFrame to select only the last 12 hours?

Question

1 answers

solution1 3 ACCPTED 2017-11-08 13:56:55

solution1
3 ACCPTED 2017-11-08 13:56:55