[英]How can I order elements in a window in python apache beam?
I noticed that java apache beam has class groupby.sortbytimestamp does python have that feature implemented yet? 我注意到java apache beam有类groupby.sortbytimestamp python是否已实现该功能? If not what would be the way to sort elements in a window?
如果不是在窗口中对元素进行排序的方法是什么? I figure I could sort the entire window in a DoFn, but I would like to know if there is a better way.
我想我可以在DoFn中对整个窗口进行排序,但我想知道是否有更好的方法。
There is not currently built-in value sorting in Beam (in either Python or Java). Beam目前没有内置的值排序(Python或Java)。 Right now, the best option is to sort the values yourself in a DoFn like you mentioned.
现在,最好的选择是在你提到的DoFn中自己对值进行排序。
Here's a solution using a CombineFn. 这是使用CombineFn的解决方案。 It has the added bonus of deduplicating data using the TreeSet.
它还有使用TreeSet对数据进行重复数据删除的额外好处。 You also should make sure your data for a window is small enough to fit in memory on a single worker.
您还应该确保窗口的数据足够小,以适应单个工作程序的内存。
public static class DedupAndSortByTime extends Combine.CombineFn<MarketData, TreeSet<MarketData>, List<MarketData>> {
@Override
public TreeSet<MarketData> createAccumulator() {
return new TreeSet<>(Comparator
.comparingLong(MarketData::getEventTime)
.thenComparing(MarketData::getOrderbookType));
}
@Override
public TreeSet<MarketData> addInput(TreeSet<MarketData> accum, MarketData input) {
accum.add(input);
return accum;
}
@Override
public TreeSet<MarketData> mergeAccumulators(Iterable<TreeSet<MarketData>> accums) {
TreeSet<MarketData> merged = createAccumulator();
for (TreeSet<MarketData> accum : accums) {
merged.addAll(accum);
}
return merged;
}
@Override
public List<MarketData> extractOutput(TreeSet<MarketData> accum) {
return Lists.newArrayList(accum.iterator());
}
} }
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.