I have a 3rd party generated CSV file that I wish to upload to Google BigQuery using dbt seed
.
I manage to upload it manually to BigQuery, but I need to enable "Quoted newlines" which is off by default.
When I run dbt seed
, I get the following error:
16:34:43 Runtime Error in seed clickup_task (data/clickup_task.csv)
16:34:43 Error while reading data, error message: CSV table references column position 31, but line starting at position:304 contains only 4 columns.
There are 32 columns in the CSV. The file contains column values with newlines. I guess that's where the dbt parser fails. I checked the dbt seed configuration options , but I haven't found anything relevant.
Any ideas?
As far as I know - the seed feature is very limited by what is built into dbt-core. So seeds is not the way that I go here. You can see the history of requests for the expansion of seed options here on the dbt-cre issues repo (including my own request for similar optionality #3990 ) but I have to see any real traction on this.
That said, what has worked very well for me is to store flat files within the gcp project in a gcs bucket and then utilize the dbt-external-tables package for very similar but much more robust file structuring. Managing this can be a lot of overhead I know but becomes very very worth it if your seed files continue expanding in a way that can take advantage of partitioning for instance.
And more importantly - as mentioned in this answer from Jeremy on stackoverflow,
The
dbt-external-tables
package supports passing a dictionary of options for BigQuery external tables, which maps to the options documented here .
Which for your case, should be either the quote
or allowQuotedNewlines
options. If you did choose to use dbt-external-tables
your source.yml for this would look something like:
gcs.yml
version: 2
sources:
- name: clickup
database: external_tables
loader: gcloud storage
tables:
- name: task
description: "External table of Snowplow events, stored as CSV files in Cloud Storage"
external:
location: 'gs://bucket/clickup/task/*'
options:
format: csv
skip_leading_rows: 1
quote: "\""
allow_quoted_newlines: true
Or something very similar. And if you end up taking this path and storing task data on a daily partition like, tasks_2022_04_16.csv
- you can access that file name and other metadata the provided pseudocolumns also shared with me by Jeremy here:
Retrieve "filename" from gcp storage during dbt-external-tables sideload?
I find it to be a very powerful set of tools for files specifically with BigQuery.
I am trying use stage_external_sources option...I have installed the package. Very basic I trying to run the example models that comes with dbt core install with the aim of creating an external hive table with the data being store into a GCS bucket.
Here is how my package yml file looks like.
And when run the dbt enter image description here
The dbt run command runs successfully but the data is not getting written to the GCS bucket. Any help on this would be greatly appreciated.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.