简体   繁体   English

如何在Pyspark中使用滑动窗口对时间序列数据进行数据转换

[英]How to transform data with sliding window over time series data in Pyspark

I am trying to extract features based on sliding window over time series data. 我试图基于时间序列数据的滑动窗口提取功能。 In Scala, it seems like there is a sliding function based on this post and the documentation 在Scala中,似乎有一个基于这篇文章文档sliding功能

import org.apache.spark.mllib.rdd.RDDFunctions._

sc.parallelize(1 to 100, 10)
  .sliding(3)
  .map(curSlice => (curSlice.sum / curSlice.size))
  .collect()

My questions is there similar functions in PySpark? 我的问题是PySpark中有类似的功能吗? Or how do we achieve similar sliding window transformations if there is no such function yet? 或者,如果没有这样的功能,我们如何实现类似的滑动窗口转换呢?

As far as I can tell sliding function is not available from Python and SlidingRDD is a private class and cannot be accessed outside MLlib . 据我所知, sliding功能不能从Python获得, SlidingRDD是私有类,不能在MLlib外部MLlib

If you to use sliding on an existing RDD you can create poor man's sliding like this: 如果你在现有的RDD上使用sliding ,你可以像这样创建穷人sliding

def sliding(rdd, n):
    assert n > 0
    def gen_window(xi, n):
        x, i = xi
        return [(i - offset, (i, x)) for offset in xrange(n)]

    return (
        rdd.
        zipWithIndex(). # Add index
        flatMap(lambda xi: gen_window(xi, n)). # Generate pairs with offset
        groupByKey(). # Group to create windows
        # Sort values to ensure order inside window and drop indices
        mapValues(lambda vals: [x for (i, x) in sorted(vals)]).
        sortByKey(). # Sort to makes sure we keep original order
        values(). # Get values
        filter(lambda x: len(x) == n)) # Drop beginning and end

Alternatively you can try something like this (with a small help of toolz ) 或者你可以尝试这样的东西(在toolz帮助下)

from toolz.itertoolz import sliding_window, concat

def sliding2(rdd, n):
    assert n > 1

    def get_last_el(i, iter):
        """Return last n - 1 elements from the partition"""
        return  [(i, [x for x in iter][(-n + 1):])]

    def slide(i, iter):
        """Prepend previous items and return sliding window"""
        return sliding_window(n, concat([last_items.value[i - 1], iter]))

    def clean_last_items(last_items):
        """Adjust for empty or to small partitions"""
        clean = {-1: [None] * (n - 1)}
        for i in range(rdd.getNumPartitions()):
            clean[i] = (clean[i - 1] + list(last_items[i]))[(-n + 1):]
        return {k: tuple(v) for k, v in clean.items()}

    last_items = sc.broadcast(clean_last_items(
        rdd.mapPartitionsWithIndex(get_last_el).collectAsMap()))

    return rdd.mapPartitionsWithIndex(slide)

To add to venuktan 's answer, here is how to create a time-based sliding window using Spark SQL and retain the full contents of the window, rather than taking an aggregate of it. 为了增加venuktan的答案,这里是如何使用Spark SQL创建一个基于时间的滑动窗口并保留窗口的全部内容,而不是采用它的聚合。 This was needed in my use case of preprocessing time series data into sliding windows for input into Spark ML. 在我将时间序列数据预处理到滑动窗口以输入Spark ML的用例中需要这样做。

One limitation of this approach is that we assume you want to take sliding windows over time. 这种方法的一个限制是我们假设你想随着时间推移滑动窗口。

Firstly, you may create your Spark DataFrame, for example by reading in a CSV file: 首先,您可以创建Spark DataFrame,例如通过读取CSV文件:

df = spark.read.csv('foo.csv')

We assume that your CSV file has two columns: one of which is a unix timestamp and the other which is a column you want to extract sliding windows from. 我们假设您的CSV文件有两列:其中一列是unix时间戳,另一列是要从中提取滑动窗口的列。

from pyspark.sql import functions as f

window_duration = '1000 millisecond'
slide_duration = '500 millisecond'

df.withColumn("_c0", f.from_unixtime(f.col("_c0"))) \
    .groupBy(f.window("_c0", window_duration, slide_duration)) \
    .agg(f.collect_list(f.array('_c1'))) \
    .withColumnRenamed('collect_list(array(_c1))', 'sliding_window')

Bonus: to convert this array column to the DenseVector format required for Spark ML, see the UDF approach here . 额外:要将此数组列转换为Spark ML所需的DenseVector格式, 请参阅此处的UDF方法

Extra Bonus: to un-nest the resulting column, such that each element of your sliding window has its own column, try this approach here . 额外奖励:要取消嵌套生成的列,以便滑动窗口的每个元素都有自己的列,请在此处尝试此方法

I hope this helps, please let me know if I can clarify anything. 我希望这有帮助,如果我能澄清任何事情,请告诉我。

spark 1.4 has window functions, as described here : https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html spark 1.4具有窗口功能,如下所述: https//databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

Hope that helps, please let me know. 希望有所帮助,请告诉我。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM