简体繁体 English

如何使用Apache BEAM在BigQuery中执行快速联接

[英]How to perform a fast join in BigQuery with Apache BEAM

原文 2019-06-19 10:03:47 9 1 java/ google-bigquery/ google-cloud-dataflow/ apache-beam

According to BEAM's programming guide , and according to many threads join can be achieved by CoGropByKey or KeyedPCollectionTuple ( coockbook ). 根据梁的节目指南，并根据多线程的 join可以通过以下方式实现CoGropByKey或KeyedPCollectionTuple （ coockbook ）。

No one is talking about the performance of these kind of transformations. 没有人在谈论这种转换的性能。

My flow should be very simple: Input of rows batch from BQ table ( TableRow ) and join them (or "enrich") with other value from other BQ table by the same key. 我的流程应该非常简单：输入来自BQ表（ TableRow ）的批处理行，并通过同一键将它们（或“ enrich”）与其他BQ表中的其他值连接起来。 So the final output should be also of type TableRow . 因此，最终输出也应该是TableRow类型。

I want to understand what will be the best practice to join 2 tables in BEAM with BQ Tables. 我想了解将BEAM中的2个表与BQ表连接在一起的最佳实践。

For example I could create a View in BQ and replace all this pipeline and perform a more efficient join operation, but I want to handle all the logic in code 例如，我可以在BQ中创建一个View并替换所有管道，并执行更有效的联接操作，但是我想处理代码中的所有逻辑

What is happening under the hood when a join operation is handling? 处理join操作时，幕后情况是什么？

Would DirectRunner perform n queries to the second BQ table in order to join all the pipeline batch (item after item)? DirectRunner第二个BQ表执行n个查询，以便加入所有管道批处理（项目后的项目）？ or BEAM is smart enough to aggregate it and perform 1 query of all the batch? 还是BEAM足够聪明来聚合它并对所有批次执行1次查询？

Does Google DataflowRunner works in a different way? Google DataflowRunner是否以其他方式工作？

How can I check the performance of this pipeline other than check the running time? 除了检查运行时间外，如何检查此管道的性能？

1 个解决方案

TTBOMK you don't want to write full SQL in code, for instance WHERE clauses. TTBOMK您不想在代码中编写完整的SQL，例如WHERE子句。 Beam or really any code based SQL joins will fail on substantial data that is kept in BQ. Beam或实际上任何基于代码的SQL连接都会因BQ中保留的大量数据而失败。 So any such "enrichment" should really be done by the underlying data crunching solution, be it SQL on top of BQ OR Spark on top of RDDs/DataFrames/etc. 因此，任何此类“扩展”都应由底层数据处理解决方案真正完成，无论是BQ之上的SQL还是RDD / DataFrames / etc之上的Spark。

Please note that it is less suitable for streaming, more for batch flows. 请注意，它不太适合流式处理，更适合批处理流。 If you want to follow along the pure streaming way, you ought to use fast DBs, per your domain and avoid OLAP style (true columnar) DBs. 如果要遵循纯流方法，则应根据您的域使用快速数据库，并避免使用OLAP样式（真正的列式）数据库。 BQ has a substantial delay per query. BQ每个查询的延迟很大。

Tell us how it goes :) 告诉我们情况如何:)