[英]How to diff two PCollection Apache Beam
I am new to Apache Beam.我是 Apache Beam 的新手。
Basically, I have two PCollection, each of them contains a number of DataRecords, which is defined as:基本上,我有两个 PCollection,每个都包含一些 DataRecords,其定义为:
class DataRecord {
private String id;
.......
}
Each record has an id and a number of data fields.每条记录都有一个 id 和一些数据字段。
I have two PCollections:我有两个 PCollection:
PCollection<DataRecord> p1 = pipeline.apply(...);
PCollection<DataRecord> p2 = pipeline.apply(...);
I need to find out:我需要找出:
DataRecord can be distinguished by its id field only. DataRecord 只能通过其 id 字段来区分。
What I have done so far is to convert the two PCollection instances into PCollection<KV<String, DataRecord>>, I now have:到目前为止我所做的是将两个 PCollection 实例转换为 PCollection<KV<String, DataRecord>>,我现在有:
PCollection<KV<String, DataRecord>> pkv1
PCollection<KV<String, DataRecord>> pkv2
However, because PCollection does not allow access by key, I don't know how to diff these two maps like we normally do in Java.但是,因为 PCollection 不允许按键访问,所以我不知道如何像我们通常在 Java 中那样区分这两个映射。
Can someone point me to the right direction?有人能指出我正确的方向吗?
You can implement it more simply for your use case, without the layers of indirection present there:您可以为您的用例更简单地实现它,而无需那里存在间接层:
CoGroupByKey
to gather elements that have the same idCoGroupByKey
收集具有相同 id 的元素ParDo
on the results to filter for elements only appearing in pkv1
ParDo
来过滤仅出现在pkv1
中的元素There is actually code that does exactly this in the Beam SQL codebase but you can do it more simply for your use case, without so much indirection.实际上,在 Beam SQL 代码库中确实有代码可以执行此操作,但您可以针对您的用例更简单地执行此操作,而无需太多间接。
The most efficient implementation will depend on the sizes of your collections and how many elements are likely to have hits.最有效的实施将取决于 collections 的大小以及可能命中的元素数量。 Another algorithm to try is
另一种尝试的算法是
View.asMap()
to produce a lookup table out of pkv2
View.asMap()
从pkv2
生成查找表ParDo
over pkv1
reading the map as a side input and filter out elements that show up in the mappkv1
ParDo
map 作为侧输入并过滤掉出现在 map 中的元素
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.