I have a PySpark DataFrame with different fields including the field timestamp
.
sdata = sc.parallelize([
[('id', 1), ('timestamp', 1506339960), ('data_pk', 111)],
[('id', 2), ('timestamp', 1506340140), ('data_pk', 222)],
...
])
# Convert to tuple
sdata_converted = sdata.map(lambda x: (x[0][1], x[1][1], x[2][1]))
# Define schema
sschema = StructType([
StructField("id", LongType(), True),
StructField("timestamp", LongType(), True),
StructField("data_pk", LongType(), True)
])
df = sqlContext.createDataFrame(sdata_converted, sschema)
How can I select only those rows that refer to the last 12 hours?
First get timestamp of 12 hours ago,
import datetime
twelve_hours = (datetime.datetime.now() -
datetime.timedelta(hours = 12)) \
.strftime("%Y-%m-%d %H:%M:%S")
then use it in a filter statement:
df_new = df.withColumn("timestamp", df.timestamp.cast("timestamp")) \
.filter(df.timestamp > twelve_hours)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.