PySpark - 为组创建特定的时间序列范围

Question

I have 1000 Tags that have a Timestamp and a Value .我有 1000 个带有Timestamp和Value Tags 。 For each of the Tags the date range is '2020-01-01', however this is too much data for each Tag.对于每个Tags ，日期范围是“2020-01-01”，但是每个标签的数据太多。 I have a separate dataframe that has a Start and End for each of the Tags in the first dataframe.我有一个单独的数据帧，其中第一个数据帧中的每个标签都有一个Start和End 。

I only need data from the date ranges mentioned above in the data with the 1000 Tags data.我只需要上面提到的带有 1000 个标签数据的数据中的日期范围的数据。 I also need the time series data in the desired dataframe padded 2 days prior to the Start and 1 day after the End dates.我还需要在Start日期前 2 天和End日期后 1 天填充所需数据框中的时间序列数据。

df1 = spark.createDataFrame(
    [("Tag 1", "2020-05-01", 1), ("Tag 1000", "2021-02-01", 1),
        ("Tag 1", "2020-05-02", 2), ("Tag 1000", "2021-02-02", 2),
        ("Tag 1", "2020-05-03", 3), ("Tag 1000", "2021-02-03", 3),
        ("Tag 1", "2020-05-04", 4), ("Tag 1000", "2021-02-04", 4),
        ("Tag 1", "2020-05-05", 5), ("Tag 1000", "2021-02-05", 5),
        ("Tag 1", "2020-05-06", 6), ("Tag 1000", "2021-02-06", 6)],
    ["Tag", "Timestamp", "Value"])

df2 = spark.createDataFrame(
    [("Tag 1", "2020-05-02", "2020-05-03"), ("Tag 1000", "2021-02-03", "2021-02-04")],
    ["Tag", "Start", "End"])

Desired Dataframe:所需的数据帧：

print(df1)

Tag       Timestamp  Value
Tag 1     2020-05-01 1
Tag 1     2020-05-02 2
Tag 1     2020-05-03 3
Tag 1     2020-05-04 4       #Notice day 5 and 6 are not in the df
Tag 1000  2020-02-01 1
Tag 1000  2020-02-02 2
Tag 1000  2020-02-03 3
Tag 1000  2020-02-04 4
Tag 1000  2020-02-05 5       #Notice day 6 are not in the df

Doing this will only give me the dates that I needed based on the second dataframe and will eliminate 1,000,000's of rows that will not be analyzing.这样做只会根据第二个数据框为我提供所需的日期，并将消除 1,000,000 行不会分析的行。

So far what I understand is creating the window.到目前为止，我所理解的是创建窗口。

w = Window().partitionBy("Tag").orderBy("Timestamp")

Answer 1

You need to convert column timestamp , start , and end to DateType first using to_date function, add padding days using date_add , finally join both dataframes where date of timestamp column is between padded start and end您需要首先使用to_date函数将列timestamp 、 start和end转换为DateType ，使用date_add添加填充天数，最后加入两个数据帧，其中时间戳列的日期在填充的开始和结束之间

from pyspark.sql.functions import col, to_date, date_add

# convert to DateType
df1 = df1.withColumn('timestamp', to_date(col('Timestamp'), "yyyy-MM-dd"))
# convert to DateType then add padding days
df2 = (df2
       .withColumn("start_date", date_add(to_date(col('Start'), 'yyyy-MM-dd'), -2))
       .withColumn("end_date", date_add(to_date(col("End"), 'yyyy-MM-dd'), 1)))

df1 = df1.join(df2.withColumnRenamed('Tag', 'Tag2'),
               [col('Tag') == col('Tag2'), col('timestamp').between(col('start_date'), col('end_date'))],
               'left_semi')
df1.show()

+--------+----------+-----+
|     Tag| timestamp|Value|
+--------+----------+-----+
|Tag 1000|2021-02-01|    1|
|Tag 1000|2021-02-02|    2|
|Tag 1000|2021-02-03|    3|
|Tag 1000|2021-02-04|    4|
|Tag 1000|2021-02-05|    5|
|   Tag 1|2020-05-01|    1|
|   Tag 1|2020-05-02|    2|
|   Tag 1|2020-05-03|    3|
|   Tag 1|2020-05-04|    4|
+--------+----------+-----+

PySpark - 为组创建特定的时间序列范围

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-07-24 02:44:36

PySpark - 为组创建特定的时间序列范围

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-07-24 02:44:36

解决方案1
1 已采纳 2021-07-24 02:44:36