简体   繁体   中英

How to filter DataFrame to select only the last 12 hours?

I have a PySpark DataFrame with different fields including the field timestamp .

sdata = sc.parallelize([ 
[('id', 1), ('timestamp', 1506339960), ('data_pk', 111)],
[('id', 2), ('timestamp', 1506340140), ('data_pk', 222)],
...
])
# Convert to tuple
sdata_converted = sdata.map(lambda x: (x[0][1], x[1][1], x[2][1]))

# Define schema
sschema = StructType([
    StructField("id", LongType(), True),
    StructField("timestamp", LongType(), True),
    StructField("data_pk", LongType(), True)
])

df = sqlContext.createDataFrame(sdata_converted, sschema)

How can I select only those rows that refer to the last 12 hours?

First get timestamp of 12 hours ago,

import datetime

twelve_hours = (datetime.datetime.now() - 
                datetime.timedelta(hours = 12)) \
                .strftime("%Y-%m-%d %H:%M:%S")

then use it in a filter statement:

df_new = df.withColumn("timestamp", df.timestamp.cast("timestamp")) \
           .filter(df.timestamp > twelve_hours)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM