简体   繁体   English

如何区分两个 PCollection Apache Beam

[英]How to diff two PCollection Apache Beam

I am new to Apache Beam.我是 Apache Beam 的新手。

Basically, I have two PCollection, each of them contains a number of DataRecords, which is defined as:基本上,我有两个 PCollection,每个都包含一些 DataRecords,其定义为:

class DataRecord {
    private String id;
    .......
}

Each record has an id and a number of data fields.每条记录都有一个 id 和一些数据字段。

I have two PCollections:我有两个 PCollection:

PCollection<DataRecord> p1 = pipeline.apply(...);
PCollection<DataRecord> p2 = pipeline.apply(...);

I need to find out:我需要找出:

  • DataRecords that exists in p1, but not in p2 p1 中存在但 p2 中不存在的 DataRecords
  • DataRecords that exists in p2, but not in p1 p2 中存在但 p1 中不存在的 DataRecords

DataRecord can be distinguished by its id field only. DataRecord 只能通过其 id 字段来区分。

What I have done so far is to convert the two PCollection instances into PCollection<KV<String, DataRecord>>, I now have:到目前为止我所做的是将两个 PCollection 实例转换为 PCollection<KV<String, DataRecord>>,我现在有:

PCollection<KV<String, DataRecord>> pkv1
PCollection<KV<String, DataRecord>> pkv2

However, because PCollection does not allow access by key, I don't know how to diff these two maps like we normally do in Java.但是,因为 PCollection 不允许按键访问,所以我不知道如何像我们通常在 Java 中那样区分这两个映射。

Can someone point me to the right direction?有人能指出我正确的方向吗?

You can implement it more simply for your use case, without the layers of indirection present there:您可以为您的用例更简单地实现它,而无需那里存在间接层:

  • use CoGroupByKey to gather elements that have the same id使用CoGroupByKey收集具有相同 id 的元素
  • use ParDo on the results to filter for elements only appearing in pkv1在结果上使用ParDo来过滤仅出现在pkv1中的元素

There is actually code that does exactly this in the Beam SQL codebase but you can do it more simply for your use case, without so much indirection.实际上,在 Beam SQL 代码库中确实有代码可以执行此操作,但您可以针对您的用例更简单地执行此操作,而无需太多间接。

The most efficient implementation will depend on the sizes of your collections and how many elements are likely to have hits.最有效的实施将取决于 collections 的大小以及可能命中的元素数量。 Another algorithm to try is另一种尝试的算法是

  • use View.asMap() to produce a lookup table out of pkv2使用View.asMap()pkv2生成查找表
  • use ParDo over pkv1 reading the map as a side input and filter out elements that show up in the mappkv1 ParDo map 作为侧输入并过滤掉出现在 map 中的元素

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何转换 PCollection<tablerow> 到个人收藏<row>在 Apache 梁?</row></tablerow> - How to convert PCollection<TableRow> to PCollection<Row> in Apache Beam? 如何在 PCollection 中组合数据 - Apache Beam - How to combine Data in PCollection - Apache beam 如何转换 PCollection<row> 在数据流 Apache 中使用 Java 束</row> - How to convert PCollection<Row> to Long in Dataflow Apache beam using Java 如何使用 Apache Beam 中的流输入 PCollection 请求 Redis 服务器? - How to request Redis server using a streaming input PCollection in Apache Beam? 如何从 PCollection 中提取信息<row>加入 apache 光束后?</row> - How to extract information from PCollection<Row> after a join in apache beam? 如何转换 PCollection<row> 使用 Java 到数据流 Apache 中的 Integer</row> - How to convert PCollection<Row> to Integer in Dataflow Apache beam using Java 如何为 PCollection 设置编码器<List<String> &gt; 在 Apache Beam 中? - How do I set the coder for a PCollection<List<String>> in Apache Beam? 如何将 JSON Array 反序列化为 Apache beam PCollection<javaobject></javaobject> - How to deserialize JSON Array to Apache beam PCollection<javaObject> Apache Beam:扁平化 PCollection <List<Foo> &gt; 到 PCollection<Foo> - Apache Beam: Flattening PCollection<List<Foo>> to PCollection<Foo> Apache Beam - 使用无界PCollection进行集成测试 - Apache Beam - Integration test with unbounded PCollection
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM