简体   繁体   中英

Load into BigQuery - one column to handle arbitrary json field (empty array, dict with different fields, etc.)

We have the following three JSONs with data that should be loaded in the same table:

{ "name": "tom", "customValues": [] }
{ "name": "joe", "customValues": { "member": "1" } }
{ "name": "joe", "customValues": { "year": "2020", "number": "3" } }

We load data with the python bigquery.LoadJobConfig function:

job_config = bigquery.LoadJobConfig(
    schema=SCHEMA_MAP.get(bq_table) if autodetect == False else None,
    source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
    write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE if remove_old == True else bigquery.WriteDisposition.WRITE_APPEND,
    autodetect=autodetect
)

SCHEMA_MAP is a dictionary of arrays, where each array in the schema for one of our tables. We define our BigQuery schema in python using the python bigquery.SchemaField function. If each of the 3 JSONs above were going into 3 different tables, I would have their table schemas defined as:

SCHEMA_T1 = [
    bigquery.SchemaField("name", "STRING"),
    bigquery.SchemaField("customValues", "STRING", mode="REPEATED")
]

SCHEMA_T2 = [
    bigquery.SchemaField("name", "STRING"),
    bigquery.SchemaField("customValues", "RECORD", mode="REPEATED", fields=[
        bigquery.SchemaField("member", "STRING")
    ])
]

SCHEMA_T3 = [
    bigquery.SchemaField("name", "STRING"),
    bigquery.SchemaField("customValues", "RECORD", mode="REPEATED", fields=[
        bigquery.SchemaField("year", "STRING"),
        bigquery.SchemaField("number", "STRING")
    ])
]

Is it possible to define the customValues column to handle all 3 of these different data types in one single table? How would the schema be defined for this? Currently, if SCHEMA_T1 is used and data in the forms of T2 or T3 is uploaded, the upload fails and it returns an error Error while reading data, error message: JSON parsing error in row starting at position 0: JSON object specified for non-record field: customValues . Similar errors for the other schemas. Is there a generic any json field in BigQuery that can be used for this?

As JSON feature is still in preview for bigquery (see launch stages ). As a workaround, you can use load_table_from_dataframe from the bigquery client to load data from data columns that might require some refinement before pushing into our working table.

Let's see on your scenario, lets said we have a data.json file with raw data:

data.json

[
    {
        "name": "tom",
        "customValues": []
    },
    {
        "name": "joe",
        "customValues": {
            "member": "1"
        }
    },
    {
        "name": "joe",
        "customValues": {
            "year": "2020",
            "number": "3"
        }
    }
]

And we have a single table on bigquery that we need to populate.

create or replace table "my-project.my-dataset.a-table" (
   name STRING,
   customValues STRING
)

load.py

from google.cloud import bigquery
import pandas as pd

client = bigquery.Client()

table_id = "project-id.dataset-id.a-table"
df = pd.read_json('data.json')
df["customValues"]= df["customValues"].apply(str)

print(df.shape)
print(df.head())

job_config = bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.PARQUET, autodetect=True)
job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result() 

table = client.get_table(table_id)  
print("Loaded {} rows and {} columns to {}".format(table.num_rows, len(table.schema), table_id))

output

| **name** | **customValues**                |
|----------|---------------------------------|
| tom      | []                              |
| joe      | {'member': '1'}                 |
| joe      | {'year': '2020', 'number': '3'} |

As you can see, regardless of the structure of customValues we are able to have that inserted on our working table ( which only have 2 columns ). We load the json data into a dataframe and then we just update the datatype column to fit our column type by using apply . For more information about using apply please visit this link .

BigQuery now supports JSON as a data type (from January 2022, you are lucky on the timing,), ref here

Therefore you should be able to go with:

SCHEMA_T = [
    bigquery.SchemaField("name", "STRING"),
    bigquery.SchemaField("customValues", "JSON")
]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM