从 Google Cloud Storage 中提取 JSON，转换为 pandas DF，然后写入 Google BigQuery

Question

摘要：将 pandas dataframe 附加到 BigQuery 时的不同types会导致日常 ETL 流程出现问题。

I am working on a straight-forward ETL with Airflow: pull data from an API daily, back that raw data up in JSON files in Google Cloud Storage (GCS), and then append the data from GCS into a BigQuery database. I am doing okay with the extract part of the ETL, calling the API and saving the results of each API call (which will be a row in the database table) as its own JSON object in GCS. 对于 BigQuery 中具有 1K 行的表，我将首先创建/保存 1K 个单独的对象，这些对象保存到 GCS 的存储桶中，每个对象都是 API 调用的结果。

我现在正在努力处理 ETL 的load部分。 到目前为止，我已经编写了以下脚本来执行从GCS到BQ的传输：

# load libraries, connect to google
from google.cloud import storage
import os
import gcsfs
import json
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/path/to/my/credentials'

# transfer data
def load_directory_to_bq():

    # get list of filenames from GCS directory
    client = storage.Client()
    files = []
    blobs = client.list_blobs('my-gcs-bucket', prefix='gcs-path-to-files')
    for blob in blobs:
        files.append(f'my-gcs-bucket/{blob.name}')
    

    # approach A: This loop pulls json, converts into df, writes to BigQuery, each 1 file at a time
    fs = gcsfs.GCSFileSystem() # GCP's Google Cloud Storage (GCS) File System (FS)
    for file in files:
        with fs.open(file, 'r') as f:
            gcs_data = json.loads(f.read())
            data = [gcs_data] if isinstance(gcs_data, dict) else gcs_data
            this_df = pd.DataFrame(data)
            pd.DataFrame.to_gbq(this_df, 'my-bq-tablename', project_id='my-gcp-project-id', if_exists='append')


    # approach B: This loop loops all the files, creates 1 large dataframe, and does 1 large insert into BigQuery
    output_df = pd.DataFrame()
    fs = gcsfs.GCSFileSystem() # GCP's Google Cloud Storage (GCS) File System (FS)
    for file in files:
        with fs.open(file, 'r') as f:
            gcs_data = json.loads(f.read())
            data = [gcs_data] if isinstance(gcs_data, dict) else gcs_data
            this_df = pd.DataFrame(data)
            output_df = output_df.append(this_df)

    pd.DataFrame.to_gbq(output_df, 'my-bq-tablename', project_id='my-gcp-project-id', if_exists='append')

GCS 中的 1K 对象都是相似的，但并不总是具有完全相同的结构：

几乎所有相同的键
每个键几乎总是相同的“类型”

但是，对于某些 JSON 对象，对于相同的键，不同对象的“类型”可能不同。 When loaded into python as a 1-row pandas dataframe, the same key key1 may be a float or an integer depending on the value. 此外，有时 object 中缺少一个键，或者它的值/属性是null ，这可能会混淆“类型”并在使用to_gbq ZC1C425268E68385D14AB5074C17A9 时导致问题。

使用上面的方法A ，第一次object / pandas DF有不同的类型，抛出以下错误： Please verify that the structure and data types in the DataFrame match the schema of the destination table. 方法A似乎也效率低下，因为它为1K 行中的每一行调用to_gbq ，并且每次调用需要 2-3 秒。

使用方法B ，似乎解决了不同的“类型”问题，因为 pandas 在其append function 中处理不同的“类型”以将 2 个数据帧附加在一起。 结果，我得到了 1 个 dataframe，并且可以将 append 发送到 BigQuery。 但是，我仍然担心将来可能需要 append 的新数据与现有表中已有的类型不匹配。 毕竟，我不是在 BigQuery 中查询旧表，追加到新数据，然后重新创建表。 我只是添加新行，我担心其中一个键具有不同“类型”的表会导致错误并破坏我的管道。

理论上，方法A很好，因为可以处理使用to_gbq附加到表中的任何单个行而没有错误的方法很好。 但它需要确保每一行都有相同的键/类型。 使用方法B ，我认为 python 将不同的类型自动合并为表的一种类型并不好，因为这似乎会导致新数据出现问题。

我正在考虑这里最好的方法是什么。 由于两者都是谷歌产品，从 GCS 到 BQ 应该很简单，但不完善的数据会使它稍微困难一些。 特别是，我是否应该在某处为每个不同的 BQ 表定义一个显式表模式，并编写一个 python function 以确保正确的类型/将错误的类型转换为正确的类型？ 我应该每次都在 BQ 中重新创建表吗？ 我应该一起避免 Python 并以另一种方式从 GCS 转移到 BQ 吗？

Answer 1

关于您的方法 A 和 B，我有以下考虑：

如果请求很慢并且你有大量的行，那么方法 B 肯定会更快。
我不知道您的数据量，但请记住，如果您有大量数据，则必须注意机器容量，以避免性能不佳和错误。
如果您的流程每天只执行一次，那么将所有数据插入表中所花费的时间可能根本不是问题。
正如您所说，方法 B 可以避免模式问题，但不能保证。

鉴于此，我想提出以下行动。

对于文件中可能丢失信息（或可能为 NULL）的键，请将BigQuery表中的相应字段设置为NULLABLE 。
使用方法 A 或 B，确保 Dataframe 具有正确的类型，方法是使用一些 function 来转换您的 Dataframe 列。 您可以更改 Dataframe 列的类型，例如df.astype({"key1": float, "key2": int, [...]}) ，您可以在此参考中找到。

Answer 2

好吧，实际上您询问 ETL 中的转换阶段，因为负载显然是由您已经使用的 pandas.DataFrame.to_gbq() 方法完成的。

当您描述它时，让我们看一下整个 ETL 流程：

来源：API -> GCS -> Pandas DataFrame -> 目的地：GBQ

注意：

您在 API 和 GCS 之间执行哪些数据转换？

然而，实际上，这里有 2 个 ETL 流：

来源：API ->?? -> 目的地：GCS（JSON 对象）

来源：GCS（JSON 对象）-> Pandas DataFrame -> 目的地：GBQ（表）

实际上，数据格式变化的根本原因来自您的 API，因为它返回 JSON 作为响应。 由于 JSON 是无模式的 object。 自然地，这种格式变化会传播到您的 GCS 对象中。 另一方面，作为目的地，您有 GBQ 表，该表从创建那一刻起就具有严格的模式，并且之后无法更改。

因此，为了有效地将来自 REST API 的数据加载到 GBQ，您可以遵循以下想法：

JSON 是嵌套数据结构，表是平面数据结构。 因此，任务是将第一个转换为第二个。
通过检查您解决此问题 API 响应 object 并定义
- 可以标准化为平面表模式的最广泛的可能字段集。 就像，所有可选字段都会立即出现。
- JSON 中的 arrays 是自复杂的对象，您非常需要它来提取和加载。 与他们一起执行第 1 步。
拥有这样的平面模式理解计划创建具有所有 NULLABLE 字段的 GBQ 表（每个 object 您将实际提取的单独的表）。
如果您使用 Pandas DataFrame 进行改造，则：
- 明确定义列的 dtypes。 当 pandas dtypes 被推断取决于即将到来的数据时，这可以避免出现问题。 请注意此处的pandas-gbq 文档
- arrays 自然会转换为 DataFrame ，然后您将在一次 GBQ API 调用中加载所有记录。

此外，您可以重新考虑 ETL 流。

目前，您说，GCS 的作用是：

(a) 对原始数据进行备份
(b) 如果 BQ 或其他地方出现问题，则作为原始数据的真实来源
(c) 如果在上传到 BQ 之前出现问题，防止必须进行两次相同的 API 调用

当您将数据并行加载到 GCS 和 GBQ 中时，所有这些都可以实现。 但是您可以通过一个常见的转换阶段来做到这一点。

Source: API -> Pandas DataFrame
    1. |-> Destination: GBQ (table)
    2. |-> Destination: GCS (objects)

您可以使用 Pandas DataFrame 执行以下转换阶段：

将 JSON object 嵌套到平面表（DataFrame）中：

 df = pd.json_normalize(api_response_json_object, 'api_response_nested_json_object', sep='_')

力场数据类型：

 def force_df_schema(df, columns_list, columns_dtypes): df = df.reindex(columns_list, axis="columns") df = df.astype(columns_dtypes) return df API_TRANSACTION_OBJECT_COLUMNS = ['c1', 'c2', 'c3', 'c4'] API_TRANSACTION_OBJECT_COLUMNS_DTYPES = { 'c1': 'object', 'c2': 'datetime64[ns]', 'c3': 'float64', 'c4': 'int' } # Let's this call will returns JSON with, for example, # {transaction} nested structure, which we need to extract, transform and load api_response_json_object = api.call() df = pd.json_normalize(api_response_json_object, 'api_response_nested_json_object', sep='_') df = force_df_schema(df, API_TRANSACTION_OBJECT_COLUMNS, API_TRANSACTION_OBJECT_COLUMNS_DTYPES)

加载到目标存储：

实际上就像你已经做的那样去GBQ

 ```
 pd.DataFrame.to_gbq(df, 'bq-tablename', project_id='gcp-project-id', if_exists='append') 
 #also this can create the initial GBQ table,
 #types will be inffered as mentioned in the pandas-bgq docs above.
 ```

也像你已经做的那样到 GCS。

从 Google Cloud Storage 中提取 JSON，转换为 pandas DF，然后写入 Google BigQuery

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-07-21 07:13:33

解决方案2
0 2020-09-19 14:13:19

从 Google Cloud Storage 中提取 JSON，转换为 pandas DF，然后写入 Google BigQuery

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-07-21 07:13:33

解决方案2 0 2020-09-19 14:13:19

解决方案1
1 已采纳 2020-07-21 07:13:33

解决方案2
0 2020-09-19 14:13:19