简体   繁体   中英

Transforming JSONs with Google Cloud Platform before BigQuery, best practice?

I have a deeply nested JSON document that is variable length and has variable arrays respective to the document, I am looking to unnest certain sections and write them to BigQuery, and disregard others.

I was excited about Dataprep by Trifacta but as they will be accessing the data, this will not work for my company. We work with healthcare data and only have authorized Google.

Has anyone worked with other solutions in GCP to transform JSONs? The nature of the document is so long and nested that writing a custom Regex and running it on a pod before ingestion is taking significant compute.

You can try this:

[1] Flatten the JSON document using jq :

cat source.json | jq -c '.[]' > target.json

[2] Load transformed JSON file (using autodetect ):

bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON mydataset.mytable target.json

Result:

BigQuery will automatically create RECORD (STRUCT) data type for nested data

Dataflow can also be useful for this purpose:

  • With this product you can create Apache Beam preprocessing pipelines that run only on Google servers.
  • With Beam's ParDo funcion you can apply any given function written in Java, Python or Go to your nested data.
  • Here you have an example on how to do it efficiently in Python.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM