简体   繁体   English

如何在 Apache Spark 中聚合时间序列数据

[英]How to aggregate timeseries data in Apache Spark

I have one dataset that contains a list of records holding time period (represented with nanoseconds: two Longs, one for start, one for end), and measured value.我有一个数据集,其中包含保存时间段的记录列表(以纳秒表示:两个 Long,一个用于开始,一个用于结束)和测量值。 I need to create new, aggregated dataset that holds just periods where values are changed.我需要创建新的聚合数据集,其中仅包含值更改的时间段。 For example:例如:

    input dataset:
    +-----+-----+-----+
    |start|end  |value|
    +-----+-----+-----+
    |123  |124  |1    |
    |124  |128  |1    |
    |128  |300  |2    |
    |300  |400  |2    |
    |400  |500  |3    |

    result dataset:
    +-----+-----+-----+
    |start|end  |value|
    +-----+-----+-----+
    |123  |128  |1    |
    |128  |400  |2    |
    |400  |500  |3    |

I know how to do this on small datasets, but have no idea how to use mapreduce paradigm, and Apache Spark.我知道如何在小型数据集上执行此操作,但不知道如何使用 mapreduce 范式和 Apache Spark。

Can you please give me a hint how to achieve this in Apache Spark, java?你能给我一个提示如何在 Apache Spark, java 中实现这一点吗?

It seems quite simple this way.这种方式看起来很简单。 If you find the min and max with groupBy and then combine the datasets.如果您使用 groupBy 找到最小值和最大值,然后组合数据集。

// df is original dataset
Dataset<Row> df_start = df.groupBy("value").min("start").withColumnRenamed("min(start)", "start").withColumnRenamed("value", "value_start");
Dataset<Row> df_end = df.groupBy("value").max("end").withColumnRenamed("max(end)", "end").withColumnRenamed("value", "value_end");

Dataset<Row> df_combined = df_start.join(df_end, df_start.col("value_start").equalTo(df_end.col("value_end"))).drop("value_end").withColumnRenamed("value_start", "value").orderBy("value");

df_combined.show(false);
+-----+-----+---+
|value|start|end|
+-----+-----+---+
|1    |123  |128|
|2    |128  |400|
|3    |400  |700|
+-----+-----+---+

One approach to this is to phrase your problem as "for each distinct value, find all the adjacent time ranges for the value and coalesce them".一种方法是将您的问题表述为“对于每个不同的值,找到该值的所有相邻时间范围并将它们合并”。 With that understanding you can use groupBy on the value to create a list of start and end 's for each value.有了这种理解,您可以在值上使用groupBy来为每个值创建startend的列表。 Then you can use a custom function to collapse these down into the contiguous time ranges.然后您可以使用自定义函数将它们折叠到连续的时间范围内。

At the extreme end, if you use disk-only persistence level on the dataset, the only requirement is that you be able to fit a single row of start_end s into memory.在极端情况下,如果您在数据集上使用仅磁盘持久性级别,唯一的要求是您能够将单行start_end放入内存中。 This puts the upper limit to this approach in the gb of start_end pairs per value for most clusters.对于大多数集群,这将这种方法的上限设置为每个值的 gb start_end对。

Here's an example implementation (using the Java API as requested - Scala would be quite a bit less verbose):这是一个示例实现(根据要求使用 Java API - Scala 会稍微不那么冗长):

public class JavaSparkTest {

    public static void main(String[] args){
        SparkSession session = SparkSession.builder()
                .appName("test-changes-in-time")
                .master("local[*]")
                .getOrCreate();
        StructField start = createStructField("start", DataTypes.IntegerType, false);
        StructField end = createStructField("end", DataTypes.IntegerType, false);
        StructField value = createStructField("value", DataTypes.IntegerType, false);
        StructType inputSchema = createStructType(asList(start,end,value));
        StructType startEndSchema = createStructType(asList(start, end));
        session.udf().register("collapse_timespans",(WrappedArray<Row> startEnds) ->
                JavaConversions.asJavaCollection(startEnds).stream()
                    .sorted((a,b)->((Comparable)a.getAs("start")).compareTo(b.getAs("start")))
                    .collect(new StartEndRowCollapsingCollector()),
                DataTypes.createArrayType(startEndSchema)
        );
        Dataset<Row> input = session.createDataFrame(asList(
                RowFactory.create(123, 124, 1),
                RowFactory.create(124, 128, 1),
                RowFactory.create(128, 300, 2),
                RowFactory.create(300, 400, 2),
                RowFactory.create(400, 500, 3),
                RowFactory.create(500, 600, 3),
                RowFactory.create(600, 700, 3)
        ), inputSchema);
        Dataset<Row> startEndByValue = input.selectExpr("(start start, end end) start_end", "value");
        Dataset<Row> startEndsByValue = startEndByValue.groupBy("value").agg(collect_list("start_end").as("start_ends"));
        Dataset<Row> startEndsCollapsed = startEndsByValue.selectExpr("value", "explode(collapse_timespans(start_ends)) as start_end");
        Dataset<Row> startEndsInColumns = startEndsCollapsed.select("value", "start_end.start", "start_end.end");
        startEndsInColumns.show();
    }

    public static class StartEndRowCollapsingCollector implements Collector<Row, List<Row>, List<Row>>{

        @Override
        public Supplier<List<Row>> supplier() {
            return ()-> new ArrayList<Row>();
        }

        @Override
        public BiConsumer<List<Row>, Row> accumulator() {
            return (rowList, row) -> {
                // if there's no rows in the list or the start doesn't match the current end
                if(rowList.size()==0 ||
                        !rowList.get(rowList.size()-1).getAs(1).equals(row.getAs(0))){
                    rowList.add(row);
                } else {
                    Row lastRow = rowList.remove(rowList.size()-1);
                    rowList.add(RowFactory.create(lastRow.getAs(0), row.getAs(1)));
                }
            };
        }

        @Override
        public BinaryOperator<List<Row>> combiner() {
            return (a,b)->{ throw new UnsupportedOperationException();};
        }

        @Override
        public Function<List<Row>, List<Row>> finisher() {
            return i->i;
        }

        @Override
        public Set<Characteristics> characteristics() {
            return Collections.EMPTY_SET;
        }
    }
}

And the program output:和程序输出:

+-----+-----+---+
|value|start|end|
+-----+-----+---+
|    1|  123|128|
|    3|  400|700|
|    2|  128|400|
+-----+-----+---+

Notice the values are not in order.请注意,这些值不是按顺序排列的。 This is because of the fact that spark has partitioned the data set and processed the value rows, and you haven't told it to assign any significance to the row ordering.这是因为 spark 已经对数据集进行了分区并处理了值行,而您没有告诉它为行排序分配任何重要性。 Should you require time or value sorted output you could of course just sort it in the usual way.如果您需要时间或值排序输出,您当然可以按照通常的方式对其进行排序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM