简体   繁体   English

如何在Apache Flink中定义数据集的起始位置?

[英]How to define the start position of a dataset in Apache Flink?

I try to implement a kind of a window function in Apache Flink. 我尝试在Apache Flink中实现一种窗口函数。 For example, I want to take the elements 1 - 5 and do something with them, afterwards I want to take the elements 6 - 10 and so on. 例如,我想采用元素1 - 5并对它们做一些事情,之后我想采用元素6 - 10,依此类推。

Currently I have a dataset whose data is derived by a CSV file: 目前我有一个数据集,其数据由CSV文件派生:

DataSet<Tuple2<Double, Double>> csvInput = env
        .readCsvFile(csvpath)
        .includeFields(usedFields)
        .types(Double.class, Double.class);

Now I want to have a subset with the first 5 elements of this dataset. 现在我想要一个包含该数据集的前5个元素的子集。 I might be able to do this with the first -function: 我可以用第first功能来做到这一点:

DataSet<Tuple2<Double, Double>> subset1 = csvInput.first(5);

But how to get the next 5 elements? 但是如何获得接下来的5个元素呢? Is there a function like a startAt function, that I can use? 是否有像startAt函数这样的函数,我可以使用吗? For example something like this: 例如这样的事情:

DataSet<Tuple2<Double, Double>> subset2 = csvInput.first(5).startAt(6);

I haven't found anything in the Apache Flink Java API. 我在Apache Flink Java API中找不到任何东西。 What is the best way to archive this? 存档这个的最佳方法是什么?

Matthias Sax has given good pointers to the streaming API for windowing. Matthias Sax为流窗口API提供了很好的指导。 If the application follows the model of streaming analytics, the streaming API is definitely the right way to go. 如果应用程序遵循流分析模型,则流API绝对是正确的方法。

Here are some more resources on stream windowing: https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#window-operators 以下是有关流窗口的更多资源: https//ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#window-operators

Windows in the Batch API 批处理API中的Windows

It is possible to manually apply some form of windowing in the Batch API as well. 也可以在Batch API中手动应用某种形式的窗口。 When applying windows, the following should be considered: 应用窗口时,应考虑以下因素:

  • Most operations are parallel. 大多数操作是并行的。 When windowing n elements together, this usually happens per parallel partition independently. 当将n个元素窗口化时,这通常独立于每个并行分区。

  • There is no implicit order of elements. 元素没有隐含的顺序。 Even when reading from a file in parallel, it may be that later sections of the file are read by a faster parallel reader thread, and records from these later segments arrives earlier. 即使从并行读取文件,也可能是文件的后续部分由更快的并行读取器线程读取,并且来自这些后续段的记录更早到达。 Windowing n elements in arrival order thus gives you simply some n elements. 在到达顺序中窗口化n个元素因此只给出了一些n个元素。

Window by Order in the File (non parallel) 文件中的顺序窗口(非并行)

To window by order in a file, you can set the input to be non-parallel (use setParallelism(1) on the source) and then use a mapPartition() to slide the window over the elements. 要按文件中的顺序窗口,可以将输入设置为非平行(在源上使用setParallelism(1) ),然后使用mapPartition()在元素上滑动窗口。

Ordered Window by some value (eg, a timestamp) 按某个值排序的窗口(例如,时间戳)

You can window ungrouped (no key) by sorting a partition ( sortPartition().mapPartition() ) or window over groups using groupBy(...).sortGroup(...).reduceGroup(...) . 您可以通过使用groupBy(...).sortGroup(...).reduceGroup(...)排序分区( sortPartition().mapPartition() )或组窗口来groupBy(...).sortGroup(...).reduceGroup(...)分组(无密钥) groupBy(...).sortGroup(...).reduceGroup(...) The functions bring the elements in order with respect to the value you want to window on, and slide over the data to window. 这些函数根据您想要窗口的值按顺序显示元素,并将数据滑动到窗口。

Some parallel windows (no good semantics) 一些并行窗口(没有好的语义)

You can always read in parallel and slide a window over the data stream using mapPartition() . 您始终可以并行读取并使用mapPartition()在数据流上滑动窗口。 However, as described above, the parallel execution and undefined order of elements will give you some windowed result, rather than a predictable windowed result. 但是,如上所述,元素的并行执行和未定义顺序将为您提供一些窗口结果,而不是可预测的窗口结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM