简体繁体中英

How to perform a fast join in BigQuery with Apache BEAM

原文 2019-06-19 10:03:47 1 1 java/ google-bigquery/ google-cloud-dataflow/ apache-beam

According to BEAM's programming guide , and according to many threads join can be achieved by CoGropByKey or KeyedPCollectionTuple ( coockbook ).

No one is talking about the performance of these kind of transformations.

My flow should be very simple: Input of rows batch from BQ table ( TableRow ) and join them (or "enrich") with other value from other BQ table by the same key. So the final output should be also of type TableRow .

I want to understand what will be the best practice to join 2 tables in BEAM with BQ Tables.

For example I could create a View in BQ and replace all this pipeline and perform a more efficient join operation, but I want to handle all the logic in code

What is happening under the hood when a join operation is handling?

Would DirectRunner perform n queries to the second BQ table in order to join all the pipeline batch (item after item)? or BEAM is smart enough to aggregate it and perform 1 query of all the batch?

Does Google DataflowRunner works in a different way?

How can I check the performance of this pipeline other than check the running time?

1 answers

TTBOMK you don't want to write full SQL in code, for instance WHERE clauses. Beam or really any code based SQL joins will fail on substantial data that is kept in BQ. So any such "enrichment" should really be done by the underlying data crunching solution, be it SQL on top of BQ OR Spark on top of RDDs/DataFrames/etc.

Please note that it is less suitable for streaming, more for batch flows. If you want to follow along the pure streaming way, you ought to use fast DBs, per your domain and avoid OLAP style (true columnar) DBs. BQ has a substantial delay per query.

Tell us how it goes :)

How to write to BigQuery with BigQuery IO in Apache Beam?

Apache beam and BigQuery

Apache Beam Dataflow BigQuery

How to specify insertId when spreaming insert to BigQuery using Apache Beam

Apache Beam on Dataflow Not Accepting ValueProvider for BigQuery Query

Apache-beam Bigquery .fromQuery ClassCastException

Combine BigQuery and Pub/Sub Apache Beam

Handling empty PCollections with BigQuery in Apache Beam

Apache Beam - BigQuery - Google Pub/Sub Batch

Write repeated Strings to BigQuery using Apache Beam

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to write to BigQuery with BigQuery IO in Apache Beam? Apache beam and BigQuery Apache Beam Dataflow BigQuery How to specify insertId when spreaming insert to BigQuery using Apache Beam Apache Beam on Dataflow Not Accepting ValueProvider for BigQuery Query Apache-beam Bigquery .fromQuery ClassCastException Combine BigQuery and Pub/Sub Apache Beam Handling empty PCollections with BigQuery in Apache Beam Apache Beam - BigQuery - Google Pub/Sub Batch Write repeated Strings to BigQuery using Apache Beam

Related Tags

How to perform a fast join in BigQuery with Apache BEAM

Question

1 answers

solution1 2 2019-06-20 11:06:53

solution1
2 2019-06-20 11:06:53