简体   繁体   中英

How to diff two PCollection Apache Beam

I am new to Apache Beam.

Basically, I have two PCollection, each of them contains a number of DataRecords, which is defined as:

class DataRecord {
    private String id;
    .......
}

Each record has an id and a number of data fields.

I have two PCollections:

PCollection<DataRecord> p1 = pipeline.apply(...);
PCollection<DataRecord> p2 = pipeline.apply(...);

I need to find out:

  • DataRecords that exists in p1, but not in p2
  • DataRecords that exists in p2, but not in p1

DataRecord can be distinguished by its id field only.

What I have done so far is to convert the two PCollection instances into PCollection<KV<String, DataRecord>>, I now have:

PCollection<KV<String, DataRecord>> pkv1
PCollection<KV<String, DataRecord>> pkv2

However, because PCollection does not allow access by key, I don't know how to diff these two maps like we normally do in Java.

Can someone point me to the right direction?

You can implement it more simply for your use case, without the layers of indirection present there:

  • use CoGroupByKey to gather elements that have the same id
  • use ParDo on the results to filter for elements only appearing in pkv1

There is actually code that does exactly this in the Beam SQL codebase but you can do it more simply for your use case, without so much indirection.

The most efficient implementation will depend on the sizes of your collections and how many elements are likely to have hits. Another algorithm to try is

  • use View.asMap() to produce a lookup table out of pkv2
  • use ParDo over pkv1 reading the map as a side input and filter out elements that show up in the map

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM