简体   繁体   中英

How can I retrieve data from multiple BigQuery tables using Apache Beam and BigQueryIO in the best way?

I understood from this thread that using " .fromQuery " is more expensive and slower than ".from", but what can I do if I need to retrieve data from multiple tables?

Currently I'm using an "INNER JOIN" query to do that, but how can I achieve the same result using ".from" (or similar)?

Since you've intended to unify a data from multiple Bigquery tables leveraging BigQueryIO.Read.from() Apache Beam method, then probably you might be able to perform separate PCollection collecting data from each Bigquery table and then apply Join logic, affording some matching criteria on a table columns.

Take a look at this example , joining two Bigquery tables within particular Pcollections by previously transforming input data to tuple collections.

The above approach is a very similar to CoGroupByKey transformation method in Apache Beam SDK as a main concept for managing relational joins between Pcollections .

Read more in thisthread about implementing Left Join transformation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM