如何在Dataflow中做两个PCollections的笛卡尔积？

Question

I would like to do a cartesian product of two PCollections.我想做两个 PCollection 的笛卡尔积。 Neither PCollection can fit into memory, so doing side input is not feasible. PCollection 都不能放入内存中，因此进行侧输入是不可行的。

My goal is this: I have two datasets.我的目标是：我有两个数据集。 One is many elements of small size.一是许多小尺寸元素。 The other is few (~10) of very large size.另一个是很少（~10）个非常大的尺寸。 I would like to take the product of these two elements and then produce key-value objects.我想取这两个元素的乘积，然后生成键值对象。

Answer 1

I think CoGroupByKey might work in your situation:我认为 CoGroupByKey 可能适用于您的情况：

https://cloud.google.com/dataflow/model/group-by-key#join https://cloud.google.com/dataflow/model/group-by-key#join

That's what I did for a similar use-case.这就是我为类似用例所做的。 Though mine had probably not been constrained by the memory (have you tried a larger cluster with bigger machines?):虽然我的可能没有受到内存的限制（您是否尝试过使用更大机器的更大集群？）：

PCollection<KV<String, TableRow>> inputClassifiedKeyed = inputClassified
        .apply(ParDo.named("Actuals : Keys").of(new ActualsRowToKeyedRow()));

PCollection<KV<String, Iterable<Map<String, String>>>> groupedCategories = p
[...]
.apply(GroupByKey.create());

So the collections are keyed by the same key.所以集合由相同的键键控。

Then I declared the Tags:然后我声明了标签：

final TupleTag<Iterable<Map<String, String>>> categoryTag = new TupleTag<>();
final TupleTag<TableRow> actualsTag = new TupleTag<>();

Combined them:组合它们：

PCollection<KV<String, CoGbkResult>> actualCategoriesCombined =
        KeyedPCollectionTuple.of(actualsTag, inputClassifiedKeyed)
                .and(categoryTag, groupedCategories)
                .apply(CoGroupByKey.create());

And in my case the final step - reformatting the results (from the tagged groups in the continuous flow:在我的情况下，最后一步 - 重新格式化结果（来自连续流中的标记组：

actualCategoriesCombined.apply(ParDo.named("Actuals : Formatting").of(
    new DoFn<KV<String, CoGbkResult>, TableRow>() {
        @Override
        public void processElement(ProcessContext c) throws Exception {
            KV<String, CoGbkResult> e = c.element();

            Iterable<TableRow> actualTableRows =
                    e.getValue().getAll(actualsTag);
            Iterable<Iterable<Map<String, String>>> categoriesAll =
                    e.getValue().getAll(categoryTag);

            for (TableRow row : actualTableRows) {
                // Some of the actuals do not have categories
                if (categoriesAll.iterator().hasNext()) {
                    row.put("advertiser", categoriesAll.iterator().next());
                }
                c.output(row);
            }
        }
    }))

Hope this helps.希望这可以帮助。 Again - not sure about the in memory constraints.再次 - 不确定内存限制。 Please do tell the results if you try this approach.如果您尝试这种方法，请务必告诉结果。

Answer 2

to create cartesian product use Apache Beam extension Join创建笛卡尔积使用Apache Beam 扩展加入

import org.apache.beam.sdk.extensions.joinlibrary.Join;

...

// Use function Join.fullOuterJoin(final PCollection<KV<K, V1>> leftCollection, final PCollection<KV<K, V2>> rightCollection, final V1 leftNullValue, final V2 rightNullValue)
// and the same key for all rows to create cartesian product as it is shown below:

    public static void process(Pipeline pipeline, DataInputOptions options) {
        PCollection<KV<Integer, CpuItem>> cpuList = pipeline
                .apply("ReadCPUs", TextIO.read().from(options.getInputCpuFile()))
                .apply("Creating Cpu Objects", new CpuItem()).apply("Preprocess Cpu",
                        MapElements
                                .into(TypeDescriptors.kvs(TypeDescriptors.integers(), TypeDescriptor.of(CpuItem.class)))
                                .via((CpuItem e) -> KV.of(0, e)));

        PCollection<KV<Integer, GpuItem>> gpuList = pipeline
                .apply("ReadGPUs", TextIO.read().from(options.getInputGpuFile()))
                .apply("Creating Gpu Objects", new GpuItem()).apply("Preprocess Gpu",
                        MapElements
                                .into(TypeDescriptors.kvs(TypeDescriptors.integers(), TypeDescriptor.of(GpuItem.class)))
                                .via((GpuItem e) -> KV.of(0, e)));

        PCollection<KV<Integer,KV<CpuItem,GpuItem>>>  cartesianProduct = Join.fullOuterJoin(cpuList, gpuList, new CpuItem(), new GpuItem());
        PCollection<String> finalResultCollection = cartesianProduct.apply("Format results", MapElements.into(TypeDescriptors.strings())
                .via((KV<Integer, KV<CpuItem,GpuItem>> e) -> e.getValue().toString()));
        finalResultCollection.apply("Output the results",
                TextIO.write().to("fps.batchproc\\parsed_cpus").withSuffix(".log"));
        pipeline.run();
    }

in the code above in this line在这一行上面的代码中

...
        .via((CpuItem e) -> KV.of(0, e)));
...

i create Map with key equals to 0 for all rows available in the input data.我为输入数据中的所有可用行创建键等于 0 的 Map。 As the result all rows are matched.结果所有行都匹配。 That is equal to SQL expression JOIN without WHERE clause相当于没有 WHERE 子句的 SQL 表达式 JOIN

如何在Dataflow中做两个PCollections的笛卡尔积？

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-12-09 00:34:39

解决方案2
0 2019-09-13 14:28:44

如何在Dataflow中做两个PCollections的笛卡尔积？

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-12-09 00:34:39

解决方案2 0 2019-09-13 14:28:44

解决方案1
3 已采纳 2016-12-09 00:34:39

解决方案2
0 2019-09-13 14:28:44