简体   繁体   English

从外部表读取与在 Bigquery 中加载数据和读取数据

[英]Reading from External Table vs Loading data and reading from it in Bigquery

I need to get data(csv format) from GCS to Bigquery and then perform ETL on it to produce results.我需要从 GCS 获取数据(csv 格式)到 Bigquery,然后对其执行 ETL 以生成结果。 The format of the csv which comes over might not be fixed and could subtly change with every file.出现的 csv 格式可能不是固定的,可能会随着每个文件而微妙地变化。 Would it be better to create temp external tables to read data directly from GCS and then process it or would be better to load data into a staging table in bigquery and then process from it.创建临时外部表以直接从 GCS 读取数据然后处理它会更好,还是将数据加载到 bigquery 中的临时表然后从中处理会更好。 I am trying to understand what is a better design in terms of execution efficiency.我试图了解在执行效率方面什么是更好的设计。 Is there a drawback with any of the approaches?任何方法都有缺点吗?

Google Cloud Platform has a service called Composer. Google Cloud Platform 有一项名为 Composer 的服务。 This is GCPs version of Apache Airflow which is software for managing data pipelines and workflows.这是 Apache Airflow 的 GCP 版本,它是用于管理数据管道和工作流的软件。 Composer being a GCP product has built-in functions to work with GCS and BigQuery. Composer 作为 GCP 产品具有与 GCS 和 BigQuery 配合使用的内置函数。 I would recommend you build your pipeline in Composer.我建议您在 Composer 中构建管道。

https://cloud.google.com/composer/ https://cloud.google.com/composer/

We use composer with GCS and Bigquery to manage the entire ETL process.我们使用 Composer 与 GCS 和 Bigquery 来管理整个 ETL 过程。

Composer >> Extract raw file from service >> Store raw file to GCS Composer >> Extract raw file from GCS >> Transform raw file >> store transformed file to GCS >> store transformed file to BigQuery Composer >> 从服务中提取原始文件 >> 将原始文件存储到 GCS Composer >> 从 GCS 中提取原始文件 >> 转换原始文件 >> 将转换后的文件存储到 GCS >> 将转换后的文件存储到 BigQuery

Composer has a many additional pipeline management features that you can take advantage of as your ETLs get more complex. Composer 具有许多额外的管道管理功能,您可以在 ETL 变得更加复杂时利用这些功能。

If I understood correctly you want to deal with exceptions caused by bad entries without interrupting the process.如果我理解正确,您希望在不中断流程的情况下处理由错误条目引起的异常。

If that's the case you want to use Cloud DataFlow and use ParDo to deal with bad entries and stick them into cloud pubsub or equivalent to be dealt with using a separate system.如果是这种情况,您想使用 Cloud DataFlow 并使用 ParDo 来处理错误条目并将它们粘贴到云发布订阅中或使用单独的系统进行处理。

See the following url for further information.有关更多信息,请参阅以下网址。

https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow

Hope this helps.希望这可以帮助。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用SchemaAndRecord类从表中读取BigQuery数值数据类型 - Reading BigQuery Numeric Data Type From Table Using SchemaAndRecord class 从 Google Drive 读取文件时,BigQuery 外部表创建失败并显示“自动检测”架构 - BigQuery external table creation failed with “autodetect” schema while reading a file from Google Drive 从任何 BigQuery 表读取数据并写入目标 BQ 表时如何获取结果行数(使用 bigquery.QueryJobConfig())? - How to get resultant row count when reading data from any BigQuery table and write to destination BQ table (Using bigquery.QueryJobConfig())? Bigquery 定价比较:将数据加载到 Bigquery 与使用创建外部表 - Bigquery Pricing Comparison : Loading data into Bigquery vs Using Create External Table 在数据流中从BigQuery读取时设置maximumBillingTier - Set maximumBillingTier when reading from BigQuery in Dataflow 从 BigQuery 读取字符串 NULL 值时出现问题 - Problem in reading string NULL values from BigQuery 将数据从 BigQuery 表加载到 Dataproc 集群时出错 - Error while loading data from BigQuery table to Dataproc cluster 使用 Bigquery Java 库从 Bigquery 读取数据时,我们可以将位置从美国更改为其他地区吗? - Can we change location from US to other region while reading data from Bigquery using Bigquery java library? 使用python从bigquery调用外部表 - Calling external table from bigquery with python 定期安排从GCS向BigQuery加载数据 - Schedule loading data from GCS to BigQuery periodically
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM