简体   繁体   中英

How to perform a fast join in BigQuery with Apache BEAM

According to BEAM's programming guide , and according to many threads join can be achieved by CoGropByKey or KeyedPCollectionTuple ( coockbook ).

No one is talking about the performance of these kind of transformations.

My flow should be very simple: Input of rows batch from BQ table ( TableRow ) and join them (or "enrich") with other value from other BQ table by the same key. So the final output should be also of type TableRow .

I want to understand what will be the best practice to join 2 tables in BEAM with BQ Tables.

For example I could create a View in BQ and replace all this pipeline and perform a more efficient join operation, but I want to handle all the logic in code

What is happening under the hood when a join operation is handling?

Would DirectRunner perform n queries to the second BQ table in order to join all the pipeline batch (item after item)? or BEAM is smart enough to aggregate it and perform 1 query of all the batch?

Does Google DataflowRunner works in a different way?

How can I check the performance of this pipeline other than check the running time?

TTBOMK you don't want to write full SQL in code, for instance WHERE clauses. Beam or really any code based SQL joins will fail on substantial data that is kept in BQ. So any such "enrichment" should really be done by the underlying data crunching solution, be it SQL on top of BQ OR Spark on top of RDDs/DataFrames/etc.

Please note that it is less suitable for streaming, more for batch flows. If you want to follow along the pure streaming way, you ought to use fast DBs, per your domain and avoid OLAP style (true columnar) DBs. BQ has a substantial delay per query.

Tell us how it goes :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM