Transforming JSONs with Google Cloud Platform before BigQuery, best practice?

Question

I have a deeply nested JSON document that is variable length and has variable arrays respective to the document, I am looking to unnest certain sections and write them to BigQuery, and disregard others.

I was excited about Dataprep by Trifacta but as they will be accessing the data, this will not work for my company. We work with healthcare data and only have authorized Google.

Has anyone worked with other solutions in GCP to transform JSONs? The nature of the document is so long and nested that writing a custom Regex and running it on a pod before ingestion is taking significant compute.

Answer 1

You can try this:

[1] Flatten the JSON document using jq :

cat source.json | jq -c '.[]' > target.json

[2] Load transformed JSON file (using autodetect ):

bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON mydataset.mytable target.json

Result:

BigQuery will automatically create RECORD (STRUCT) data type for nested data

Answer 2

Dataflow can also be useful for this purpose:

With this product you can create Apache Beam preprocessing pipelines that run only on Google servers.
With Beam's ParDo funcion you can apply any given function written in Java, Python or Go to your nested data.
Here you have an example on how to do it efficiently in Python.

Transforming JSONs with Google Cloud Platform before BigQuery, best practice?

Question

2 answers

solution1
1 2020-08-28 06:24:48

solution2
0 2020-09-02 14:40:03

Transforming JSONs with Google Cloud Platform before BigQuery, best practice?

Question

2 answers

solution1 1 2020-08-28 06:24:48

solution2 0 2020-09-02 14:40:03

solution1
1 2020-08-28 06:24:48

solution2
0 2020-09-02 14:40:03