分割时间序列数据框

Question

I have a dataframe where I have different parameters as columns and a timestamp for each row of parameters. 我有一个数据框，其中有不同的参数作为列，每行参数都有一个时间戳。

What I want to do is to split the dataframe into windows, where column values from each row get all appended into a single row. 我想做的是将数据框拆分为多个窗口，其中每一行的列值都被附加到单个行中。 This will enable me to run clustering using these as features. 这将使我能够使用这些功能运行集群。

For example, I want to transform dataframe like this (window size 3): 例如，我要像这样转换数据框（窗口大小3）：

2017-01-01 00:00:01, a1, b1, c1
2017-01-01 00:00:02, a2, b2, c2
2017-01-01 00:00:03, a3, b3, c3
2017-01-01 00:00:04, a4, b4, c4
2017-01-01 00:00:05, a5, b5, c5
2017-01-01 00:00:06, a6, b6, c6
2017-01-01 00:00:07, a7, b7, c7

Into something like this: 变成这样的东西：

2017-01-01 00:00:01, 2017-01-01 00:00:03, a1, a2, a3, b1, b2, b3, c1, c2, c3
2017-01-01 00:00:04, 2017-01-01 00:00:06, a4, a5, a6, b4, b5, b6, c4, c5, c6

I need to preserve information which time interval belongs to which cluster, after clustering, so that is why I also need to keep the time ranges. 聚类后，我需要保留哪个时间间隔属于哪个聚类的信息，因此这就是为什么我还需要保留时间范围的原因。 The last instant in the example was excluded as there's not enough measurements to create another window. 由于没有足够的度量值来创建另一个窗口，因此排除了示例中的最后一个瞬间。

How can I do this using Spark? 如何使用Spark执行此操作？

Answer 1

Let's start with some data, according to your description: 根据您的描述，让我们从一些数据开始：

from pyspark.sql.functions import unix_timestamp

df = sc.parallelize([("2017-01-01 00:00:01", 2.0, 2.0, 2.0),
("2017-01-01 00:00:08", 9.0, 9.0, 9.0),
("2017-01-01 00:00:02", 3.0, 3.0, 3.0),
("2017-01-01 00:00:03", 4.0, 4.0, 4.0),
("2017-01-01 00:00:04", 5.0, 5.0, 5.0),
("2017-01-01 00:00:05", 6.0, 6.0, 6.0),
("2017-01-01 00:00:06", 7.0, 7.0, 7.0),
("2017-01-01 00:00:07", 8.0, 8.0, 8.0)]).toDF(["time","a","b","c"])
df = df.withColumn("time", unix_timestamp("time", "yyyy-MM-dd HH:mm:ss").cast("timestamp"))

> Spark 2.0 > Spark 2.0

We could generate a new interval column using the ceil() function , by which then we can group your data and collect all the other variables into one flat list. 我们可以使用ceil()函数生成一个新的interval列，通过该列，我们可以对数据进行分组，并将所有其他变量收集到一个平面列表中。

To guarantee correct ordering inside the resulting lists, irrespective of initial order, we'll use Window functions, to partition your data by date , creating a rank column ordered by the time . 为了保证结果列表内的顺序正确，无论初始顺序如何，我们将使用Window函数，按date对数据进行分区，并创建一个按time排序的rank列。

from pyspark.sql.window import Window
from pyspark.sql.functions import ceil

df = df.withColumn("date", df["time"].cast("date")) \
       .withColumn("interval", ((ceil(df["time"].cast("long") / 3L))*3.0).cast("timestamp")) 

window = Window.partitionBy(df['date']).orderBy(df['time'])

Because we'll collect the rank column into the nested list for correct ordering, we'll define an udf that eventually unpacks all values in the nested lists, but the first one , which is the rank : 因为我们将rank列收集到嵌套列表中以进行正确排序，所以我们将定义一个udf ，该udf最终将嵌套列表中的所有值解包，但第一个值是rank ：

def unnest(col):

  l = [item[1:] for item in col]
  res = [item for sublist in l for item in sublist]

  return(res)

unnest_udf = udf(unnest)

Now we putting everything together: 现在，我们将所有内容放在一起：

from pyspark.sql.functions import rank
from pyspark.sql.functions import collect_list, array

df.select('*', rank().over(window).alias('rank')) \
  .groupBy("interval") \
  .agg(collect_list(array("rank","a", "b","c")).alias("vals")) \
  .withColumn("vals", unnest_udf("vals")) \
  .sort("interval") \
  .show(truncate = False)
+---------------------+---------------------------------------------+
|interval             |vals                                         |
+---------------------+---------------------------------------------+
|2017-01-01 00:00:03.0|[2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0]|
|2017-01-01 00:00:06.0|[5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0]|
|2017-01-01 00:00:09.0|[8.0, 8.0, 8.0, 9.0, 9.0, 9.0]               |
+---------------------+---------------------------------------------+

Spark 1.6 火花1.6

We cannot use array as an argument inside collect_list() , so we'll just wrap the collect_list() calls inside array , instead of the other way around. 我们不能用array作为内部参数collect_list()所以我们只包住collect_list()内调用array ，而不是周围的其他方法。 We'll also slightly modify our udf because we won't be explicitly needing the rank column using this approach. 我们还将略微修改udf因为使用此方法不会明确需要rank列。

unpack_udf = udf(
    lambda l: [item for sublist in l for item in sublist]
)

df.select('*', rank().over(window).alias('rank')) \
  .groupBy("interval") \
  .agg(array(collect_list("a"),
             collect_list("b"),
             collect_list("c")).alias("vals")) \
  .withColumn("vals", unpack_udf("vals")) \
  .sort("interval") \
  .show(truncate = False)
+---------------------+---------------------------------------------+
|interval             |vals                                         |
+---------------------+---------------------------------------------+
|2017-01-01 00:00:03.0|[2.0, 3.0, 4.0, 2.0, 3.0, 4.0, 2.0, 3.0, 4.0]|
|2017-01-01 00:00:06.0|[5.0, 6.0, 7.0, 5.0, 6.0, 7.0, 5.0, 6.0, 7.0]|
|2017-01-01 00:00:09.0|[8.0, 9.0, 8.0, 9.0, 8.0, 9.0]               |
+---------------------+---------------------------------------------+

Note that vals column is now ordered in a different way, yet consistently thanks to the window function we defined earlier. 请注意，现在vals列的排序方式有所不同，这要归功于我们之前定义的window函数。

分割时间序列数据框

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-04-07 12:08:19

分割时间序列数据框

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-04-07 12:08:19

解决方案1
3 已采纳 2017-04-07 12:08:19