I am new to Apache Beam.
Basically, I have two PCollection, each of them contains a number of DataRecords, which is defined as:
class DataRecord {
private String id;
.......
}
Each record has an id and a number of data fields.
I have two PCollections:
PCollection<DataRecord> p1 = pipeline.apply(...);
PCollection<DataRecord> p2 = pipeline.apply(...);
I need to find out:
DataRecord can be distinguished by its id field only.
What I have done so far is to convert the two PCollection instances into PCollection<KV<String, DataRecord>>, I now have:
PCollection<KV<String, DataRecord>> pkv1
PCollection<KV<String, DataRecord>> pkv2
However, because PCollection does not allow access by key, I don't know how to diff these two maps like we normally do in Java.
Can someone point me to the right direction?
You can implement it more simply for your use case, without the layers of indirection present there:
CoGroupByKey
to gather elements that have the same idParDo
on the results to filter for elements only appearing in pkv1
There is actually code that does exactly this in the Beam SQL codebase but you can do it more simply for your use case, without so much indirection.
The most efficient implementation will depend on the sizes of your collections and how many elements are likely to have hits. Another algorithm to try is
View.asMap()
to produce a lookup table out of pkv2
ParDo
over pkv1
reading the map as a side input and filter out elements that show up in the map
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.