简体   繁体   English

我可以使用 python 对 Apache beam PCollection 中的项目进行排序吗?

[英]Can I sort the items in an Apache beam PCollection using python?

Can I sort the items in an Apache beam PCollection using python?我可以使用 python 对 Apache beam PCollection 中的项目进行排序吗?

I need to perform an operation (transformation) that relies on the items to be sorted.我需要执行依赖于要排序的项目的操作(转换)。 But so far, I cannot find any trace of a "sorting" mechanism for the Apache beam.但到目前为止,我找不到 Apache 光束的任何“排序”机制的踪迹。

My use case is not for live streams.我的用例不适用于直播。 I understand that it is pointless to talk about sorting when the data is live and/or infinite.我知道当数据是实时的和/或无限的时候谈论排序是没有意义的。 This is an operation on an offline dataset.这是对离线数据集的操作。

Is this possible?这可能吗?

Apparently, this is impossible .显然,这是不可能的 At least, so far I could not find any way of doing this.至少,到目前为止,我找不到任何方法来做到这一点。 And thinking about it logically, since Beam supports stream processing and batch processing alike, and sorting is definitely impossible for streaming, then the logical conclusion is that Beam cannot support sorting at all.而且顺理成章地想,既然Beam支持stream处理和批处理一样,排序肯定是流式不行的,那么逻辑上的结论是Beam根本不支持排序。

But still, there might be some use cases that you think that they rely on sorting but you still can implement them without actually sorting the items.但是,仍然可能有一些用例,您认为它们依赖于排序,但您仍然可以在不实际对项目进行排序的情况下实现它们。 And my case was one of those.我的案例就是其中之一。

To expand on my use case, I wanted to find the nth item in the list to implement a bucketization.为了扩展我的用例,我想找到列表中的第 n 个项目来实现分桶。 Like if I want to bucketize my dataset into 4 bins and I have a total of 100 items in the dataset, I'll need the 1st, 25th, 50th, 75th, and 100th items of the list so all my bins have the same number of items in them.就像如果我想将我的数据集分桶成 4 个箱子并且我在数据集中共有 100 个项目,我将需要列表中的第 1、25、50、75 和 100 个项目,这样我的所有箱子都有相同的编号其中的项目。

Initially, I thought I'll need to sort the list and take the mentioned items from it.最初,我认为我需要对列表进行排序并从中取出提到的项目。 And since Beam does not support sorting, it was impossible.而且由于 Beam 不支持排序,所以这是不可能的。 But then, I found another way of doing the same thing:但是后来,我找到了另一种做同样事情的方法:

import apache_beam as beam


with beam.Pipeline() as p:
    all_items = (
        p
        | 'Create dummy data' >> beam.Create([i for i in range(100)])
    )

    item_1st = (
        all_items
        | '1st item' >> beam.combiners.Top.Smallest(1)
        | 'FlatMap_1 for 1st' >> beam.FlatMap(lambda record: record)
    )

    item_25th = (
        all_items
        | '75 largest items' >> beam.combiners.Top.Largest(75)
        | 'FlatMap_1 for 25' >> beam.FlatMap(lambda record: record)
        | '25th item' >> beam.combiners.Top.Smallest(1)
        | 'FlatMap_2 for 25' >> beam.FlatMap(lambda record: record)
    )

    item_50th = (
        all_items
        | '50 largest items' >> beam.combiners.Top.Largest(50)
        | 'FlatMap_1 for 50' >> beam.FlatMap(lambda record: record)
        | '50th item' >> beam.combiners.Top.Smallest(1)
        | 'FlatMap_2 for 50' >> beam.FlatMap(lambda record: record)
    )

    item_75th = (
        all_items
        | '25 largest items' >> beam.combiners.Top.Largest(25)
        | 'FlatMap_1 for 75' >> beam.FlatMap(lambda record: record)
        | '75th item' >> beam.combiners.Top.Smallest(1)
        | 'FlatMap_2 for 75' >> beam.FlatMap(lambda record: record)
    )

    item_100th = (
        all_items
        | '100th item' >> beam.combiners.Top.Largest(1)
        | 'FlatMap_1 for 100st' >> beam.FlatMap(lambda record: record)
    )

    _ = (
        (item_1st, item_25th, item_50th, item_75th, item_100th)
        | beam.Flatten()
        | f'All bins' >> beam.combiners.ToList()
        | beam.io.WriteToText('data/bins.txt')
    )

This code returns something like this:此代码返回如下内容:

[99, 0, 50, 75, 25]

There are a couple of notes to make here.这里有几个注意事项。 First of all, as you can see the final output contains the numbers we were expecting but in the wrong order.首先,如您所见,最终的 output 包含我们预期的数字,但顺序错误。 That's because Beam does not guarantee the order of items in the output. Secondly, if you run the code, you might the same answer but in a different order.那是因为 Beam 不保证 output 中项目的顺序。其次,如果您运行代码,您可能会得到相同的答案,但顺序不同。 That's because the order of the items in Beam is random.那是因为 Beam 中项目的顺序是随机的。

In the end, I just want to point out that the code I provided is not an answer to my original question.最后,我只想指出,我提供的代码并不是我原来问题的答案。 The answer is that Beam does not support sorting .答案是Beam 不支持排序 But, there might be some other way to achieve what you want to do.但是,可能还有其他方法可以实现您想要做的事情。 Still, if you are sure that sorting is necessary for your case, then, unfortunately, Beam is not going to be practical for you.尽管如此,如果您确定您的情况需要排序,那么不幸的是,Beam 对您来说并不实用。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Apache 光束列表到 PCollection - Apache beam list to PCollection 无法将Beam Python PCollection转换为列表 - Can't convert beam python pcollection into list 我可以分块处理 apache 光束中的 pcollections 吗? 我可以分批制作 pcollection 并分别处理每批吗? - Can I process pcollections in apache beam in chunks? Can I make batches of pcollection and process each batch separately? 如何从PCollection Apache Beam Python创建N个元素组 - How to create groups of N elements from a PCollection Apache Beam Python python中的Apache Beam:如何在另一个PCollection上重用完全相同的转换 - Apache Beam in python: How to reuse exactly the same transform on another PCollection 如何使用Apache Beam在Python中将有界pcollection转换为无界? - How to transform bounded pcollection to unbounded in Python with Apache Beam? 使用python的Apache Beam中PCollection内几个字段的最大值和最小值 - Max and Min for several fields inside PCollection in apache beam with python Apache Beam-Python:如何通过累积获取PCollection的前10个元素? - Apache Beam - Python : How to get the top 10 elements of a PCollection with Accumulation? 如何在 Python 中使用 apache beam Pipeline 处理异常? - How can I handle an exception using apache beam Pipeline in Python? 如何在使用 python SDK 将 BIG QUERY 中的数据读取到 apache 光束中的 PCollection 时将源列重命名为目标列名 - how to rename the source columns to target column names while reading the data from BIG QUERY into PCollection in apache beam using python SDK
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM