简体繁体 English

从外部表读取与在 Bigquery 中加载数据和读取数据

[英]Reading from External Table vs Loading data and reading from it in Bigquery

原文 2019-12-04 16:57:33 4 2 google-cloud-platform/ google-bigquery

I need to get data(csv format) from GCS to Bigquery and then perform ETL on it to produce results.我需要从 GCS 获取数据（csv 格式）到 Bigquery，然后对其执行 ETL 以生成结果。 The format of the csv which comes over might not be fixed and could subtly change with every file.出现的 csv 格式可能不是固定的，可能会随着每个文件而微妙地变化。 Would it be better to create temp external tables to read data directly from GCS and then process it or would be better to load data into a staging table in bigquery and then process from it.创建临时外部表以直接从 GCS 读取数据然后处理它会更好，还是将数据加载到 bigquery 中的临时表然后从中处理会更好。 I am trying to understand what is a better design in terms of execution efficiency.我试图了解在执行效率方面什么是更好的设计。 Is there a drawback with any of the approaches?任何方法都有缺点吗？

2 个解决方案

Google Cloud Platform has a service called Composer. Google Cloud Platform 有一项名为 Composer 的服务。 This is GCPs version of Apache Airflow which is software for managing data pipelines and workflows.这是 Apache Airflow 的 GCP 版本，它是用于管理数据管道和工作流的软件。 Composer being a GCP product has built-in functions to work with GCS and BigQuery. Composer 作为 GCP 产品具有与 GCS 和 BigQuery 配合使用的内置函数。 I would recommend you build your pipeline in Composer.我建议您在 Composer 中构建管道。

https://cloud.google.com/composer/ https://cloud.google.com/composer/

We use composer with GCS and Bigquery to manage the entire ETL process.我们使用 Composer 与 GCS 和 Bigquery 来管理整个 ETL 过程。

Composer >> Extract raw file from service >> Store raw file to GCS Composer >> Extract raw file from GCS >> Transform raw file >> store transformed file to GCS >> store transformed file to BigQuery Composer >> 从服务中提取原始文件 >> 将原始文件存储到 GCS Composer >> 从 GCS 中提取原始文件 >> 转换原始文件 >> 将转换后的文件存储到 GCS >> 将转换后的文件存储到 BigQuery

Composer has a many additional pipeline management features that you can take advantage of as your ETLs get more complex. Composer 具有许多额外的管道管理功能，您可以在 ETL 变得更加复杂时利用这些功能。

If I understood correctly you want to deal with exceptions caused by bad entries without interrupting the process.如果我理解正确，您希望在不中断流程的情况下处理由错误条目引起的异常。

If that's the case you want to use Cloud DataFlow and use ParDo to deal with bad entries and stick them into cloud pubsub or equivalent to be dealt with using a separate system.如果是这种情况，您想使用 Cloud DataFlow 并使用 ParDo 来处理错误条目并将它们粘贴到云发布订阅中或使用单独的系统进行处理。

See the following url for further information.有关更多信息，请参阅以下网址。

https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow

Hope this helps.希望这可以帮助。