简体   繁体   中英

How to save hyperLogLog field to BigQuery from ApacheBeam with Data Flow runner

I need to save HLL sketches into BigQuery from ApacheBeam.

I found some extension library for Apache-Beam that does it:

But I can't find a way to save the sketch itself to BigQuery. to be able to use it later with merge function and other functions by some time sliding: see this link

my code:

 .apply("hll-count",  Combine.perKey(ApproximateDistinct.ApproximateDistinctFn
                            .create(StringUtf8Coder.of())))
.apply("reify-windows", Reify.windows())
                    .apply("to-table-row", ParDo.of(new DoFn< ValueInSingleWindow<KV<GroupByData,HyperLogLogPlus>>, TableRow>() {
                        @ProcessElement
                        public void processElement(ProcessContext processContext) {
                            ValueInSingleWindow<KV<GroupByData,HyperLogLogPlus>> windowed = processContext.element();
                            KV<GroupByData, HyperLogLogPlus> keyData = windowed.getValue();
                            GroupByData key = keyData.getKey();

                            HyperLogLogPlus hyperLogLogPlus = keyData.getValue();
                            if (key != null) {

                                TableRow tableRow = new TableRow();
                                tableRow.set("country_code",key.countryCode);
                                tableRow.set("event", key.event);
                                tableRow.set("profile", key.profile);

                                 tableRow.set("occurrences", hyperLogLogPlus.cardinality());

I just found how to do hyperLogLogPlus.cardinality() but how can write the buffer itself, in way I can run on it later merge function, in BiGQuery.

Using hyperLogLogPlus.getBytes also didn't work for merge.

Currently this functionality is not supported by Apache Beam, but there are people working on it.

To be specific: The extension library in Apache Beam you mentioned depends on this HyperLogLog implementation. The sketches produced by this library is not consistent with the sketches computed by Google Cloud BigQuery. So it wouldn't make sense to merge sketches in BigQuery.

Since this question was first asked in 2019 April, a BigQuery-compatible implementation of HLL sketch has been released, as noted in this GCP blog post, Using HLL++ to speed up count-distinct in massive datasets .

The post has illustrative code snippets showing how to save the HLL sketches to BigQuery as well as to GCS files.

Quoting the relevant parts of the post:

[The Google implementation of HyperLogLog] was added to BigQuery in 2017 and has recently been open sourced and made directly available in Apache Beam as of version 2.16. That means it's available for use in Cloud Dataflow ...

Note: As of version 2.16, there are several implementations of approximate count algorithms. We recommend the use of HllCount.java , especially if you need sketches and/or need compatibility with Google Cloud BigQuery.

From section 3 of the post, "Storing the sketches in BigQuery":

BigQuery supports HLL++ via the HLL_COUNT functions, and BigQuery's sketches are fully compatible with Beam's, so it's easy to interoperate with sketch objects across both systems.

In the example below we will: 1. Pre-aggregate data into sketches in Beam; 2. Store the sketches in BigQuery as byte[] columns along with some metadata about the time interval; 3. Run a rollup query in BigQuery, which can extract the results at interactive speed, thanks to the sketches that were pre-computed in Beam.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM