[英]Split time-series dataframe
I have a dataframe where I have different parameters as columns and a timestamp for each row of parameters. 我有一个数据框,其中有不同的参数作为列,每行参数都有一个时间戳。
What I want to do is to split the dataframe into windows, where column values from each row get all appended into a single row. 我想做的是将数据框拆分为多个窗口,其中每一行的列值都被附加到单个行中。 This will enable me to run clustering using these as features.
这将使我能够使用这些功能运行集群。
For example, I want to transform dataframe like this (window size 3): 例如,我要像这样转换数据框(窗口大小3):
2017-01-01 00:00:01, a1, b1, c1
2017-01-01 00:00:02, a2, b2, c2
2017-01-01 00:00:03, a3, b3, c3
2017-01-01 00:00:04, a4, b4, c4
2017-01-01 00:00:05, a5, b5, c5
2017-01-01 00:00:06, a6, b6, c6
2017-01-01 00:00:07, a7, b7, c7
Into something like this: 变成这样的东西:
2017-01-01 00:00:01, 2017-01-01 00:00:03, a1, a2, a3, b1, b2, b3, c1, c2, c3
2017-01-01 00:00:04, 2017-01-01 00:00:06, a4, a5, a6, b4, b5, b6, c4, c5, c6
I need to preserve information which time interval belongs to which cluster, after clustering, so that is why I also need to keep the time ranges. 聚类后,我需要保留哪个时间间隔属于哪个聚类的信息,因此这就是为什么我还需要保留时间范围的原因。 The last instant in the example was excluded as there's not enough measurements to create another window.
由于没有足够的度量值来创建另一个窗口,因此排除了示例中的最后一个瞬间。
How can I do this using Spark? 如何使用Spark执行此操作?
Let's start with some data, according to your description: 根据您的描述,让我们从一些数据开始:
from pyspark.sql.functions import unix_timestamp
df = sc.parallelize([("2017-01-01 00:00:01", 2.0, 2.0, 2.0),
("2017-01-01 00:00:08", 9.0, 9.0, 9.0),
("2017-01-01 00:00:02", 3.0, 3.0, 3.0),
("2017-01-01 00:00:03", 4.0, 4.0, 4.0),
("2017-01-01 00:00:04", 5.0, 5.0, 5.0),
("2017-01-01 00:00:05", 6.0, 6.0, 6.0),
("2017-01-01 00:00:06", 7.0, 7.0, 7.0),
("2017-01-01 00:00:07", 8.0, 8.0, 8.0)]).toDF(["time","a","b","c"])
df = df.withColumn("time", unix_timestamp("time", "yyyy-MM-dd HH:mm:ss").cast("timestamp"))
> Spark 2.0 > Spark 2.0
We could generate a new interval
column using the ceil()
function , by which then we can group your data and collect all the other variables into one flat list. 我们可以使用
ceil()
函数生成一个新的interval
列,通过该列,我们可以对数据进行分组,并将所有其他变量收集到一个平面列表中。
To guarantee correct ordering inside the resulting lists, irrespective of initial order, we'll use Window
functions, to partition your data by date
, creating a rank
column ordered by the time
. 为了保证结果列表内的顺序正确,无论初始顺序如何,我们将使用
Window
函数,按date
对数据进行分区,并创建一个按time
排序的rank
列。
from pyspark.sql.window import Window
from pyspark.sql.functions import ceil
df = df.withColumn("date", df["time"].cast("date")) \
.withColumn("interval", ((ceil(df["time"].cast("long") / 3L))*3.0).cast("timestamp"))
window = Window.partitionBy(df['date']).orderBy(df['time'])
Because we'll collect the rank
column into the nested list for correct ordering, we'll define an udf
that eventually unpacks all values in the nested lists, but the first one , which is the rank
: 因为我们将
rank
列收集到嵌套列表中以进行正确排序,所以我们将定义一个udf
,该udf
最终将嵌套列表中的所有值解包,但第一个值是rank
:
def unnest(col):
l = [item[1:] for item in col]
res = [item for sublist in l for item in sublist]
return(res)
unnest_udf = udf(unnest)
Now we putting everything together: 现在,我们将所有内容放在一起:
from pyspark.sql.functions import rank
from pyspark.sql.functions import collect_list, array
df.select('*', rank().over(window).alias('rank')) \
.groupBy("interval") \
.agg(collect_list(array("rank","a", "b","c")).alias("vals")) \
.withColumn("vals", unnest_udf("vals")) \
.sort("interval") \
.show(truncate = False)
+---------------------+---------------------------------------------+
|interval |vals |
+---------------------+---------------------------------------------+
|2017-01-01 00:00:03.0|[2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0]|
|2017-01-01 00:00:06.0|[5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0]|
|2017-01-01 00:00:09.0|[8.0, 8.0, 8.0, 9.0, 9.0, 9.0] |
+---------------------+---------------------------------------------+
Spark 1.6 火花1.6
We cannot use array
as an argument inside collect_list()
, so we'll just wrap the collect_list()
calls inside array
, instead of the other way around. 我们不能用
array
作为内部参数collect_list()
所以我们只包住collect_list()
内调用array
,而不是周围的其他方法。 We'll also slightly modify our udf
because we won't be explicitly needing the rank
column using this approach. 我们还将略微修改
udf
因为使用此方法不会明确需要rank
列。
unpack_udf = udf(
lambda l: [item for sublist in l for item in sublist]
)
df.select('*', rank().over(window).alias('rank')) \
.groupBy("interval") \
.agg(array(collect_list("a"),
collect_list("b"),
collect_list("c")).alias("vals")) \
.withColumn("vals", unpack_udf("vals")) \
.sort("interval") \
.show(truncate = False)
+---------------------+---------------------------------------------+
|interval |vals |
+---------------------+---------------------------------------------+
|2017-01-01 00:00:03.0|[2.0, 3.0, 4.0, 2.0, 3.0, 4.0, 2.0, 3.0, 4.0]|
|2017-01-01 00:00:06.0|[5.0, 6.0, 7.0, 5.0, 6.0, 7.0, 5.0, 6.0, 7.0]|
|2017-01-01 00:00:09.0|[8.0, 9.0, 8.0, 9.0, 8.0, 9.0] |
+---------------------+---------------------------------------------+
Note that vals
column is now ordered in a different way, yet consistently thanks to the window
function we defined earlier. 请注意,现在
vals
列的排序方式有所不同,这要归功于我们之前定义的window
函数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.