简体   繁体   中英

Apache Beam count of unique elements

I have an Apache Beam job, which injest data from PubSub and then load into BigQuery, I transform PubSub message to pojo with fields

id, name, count

Count mean the count of not unique elements into single ingest.

If i load from PubSub 3 elements, two of which are same, then i need to load into BigQuery 2 elements, one of them will have count 2.

I wonder how easily make it in Apache Beam. I tried to make it wia DoFn or MapElements, but there i can process only single element. I also tried to convert element to KV, and then count, but i have non determenistics coder.

In usual java app i can simple use equals or via Map, but here in Apache beam all is different.

The simple and right approach would be to use Count.<T>perElement() , like this:

Pipeline p = ...;
PCollection<T> elements = p.apply(...); // read elements
PCollection<KV<T, Long>> elementsCounts =
    elements.apply(Count.<T>perElement());
PCollection<TableRow> results = elementsCounts.apply(ParDo.of(
    new FormatOutputFn()));

Though, right, you need to have a deterministic elements coder for that. So if it's not case (as I understand from what you said above) you need to add a step before Count to transform an element into different representation where it will be possible to have a deterministic coder (like AvroCoder , for example).

If it's not possible for some reasons, then another workaround could be to calculate an uniq hash for every element (but the hash value must be deterministic as well), create a KV for every element with new hash as a Key and element as a Value and use GroupByKey downstream to have a grouped tuple of the same values.

Also, please note, that since PubSub is an unbounded source, you need to "window" your input by any type of Windows strategy (except Global one) since all your group/combine operations should be done inside a window. Take a look on WindowedWordCount as an example of solution for similar problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM