简体   繁体   中英

How to upload csv data that contains newline with dbt

I have a 3rd party generated CSV file that I wish to upload to Google BigQuery using dbt seed .

I manage to upload it manually to BigQuery, but I need to enable "Quoted newlines" which is off by default.

When I run dbt seed , I get the following error:

16:34:43  Runtime Error in seed clickup_task (data/clickup_task.csv)
16:34:43    Error while reading data, error message: CSV table references column position 31, but line starting at position:304 contains only 4 columns.

There are 32 columns in the CSV. The file contains column values with newlines. I guess that's where the dbt parser fails. I checked the dbt seed configuration options , but I haven't found anything relevant.

Any ideas?

As far as I know - the seed feature is very limited by what is built into dbt-core. So seeds is not the way that I go here. You can see the history of requests for the expansion of seed options here on the dbt-cre issues repo (including my own request for similar optionality #3990 ) but I have to see any real traction on this.


That said, what has worked very well for me is to store flat files within the gcp project in a gcs bucket and then utilize the dbt-external-tables package for very similar but much more robust file structuring. Managing this can be a lot of overhead I know but becomes very very worth it if your seed files continue expanding in a way that can take advantage of partitioning for instance.

And more importantly - as mentioned in this answer from Jeremy on stackoverflow,

The dbt-external-tables package supports passing a dictionary of options for BigQuery external tables, which maps to the options documented here .

Which for your case, should be either the quote or allowQuotedNewlines options. If you did choose to use dbt-external-tables your source.yml for this would look something like:

gcs.yml

version: 2

sources:
  - name: clickup
    database: external_tables
    loader: gcloud storage
  
    tables:
      - name: task
        description: "External table of Snowplow events, stored as CSV files in Cloud Storage"
        external:
          location: 'gs://bucket/clickup/task/*'
          options:
            format: csv
            skip_leading_rows: 1
            quote: "\""
            allow_quoted_newlines: true

Or something very similar. And if you end up taking this path and storing task data on a daily partition like, tasks_2022_04_16.csv - you can access that file name and other metadata the provided pseudocolumns also shared with me by Jeremy here:

Retrieve "filename" from gcp storage during dbt-external-tables sideload?

I find it to be a very powerful set of tools for files specifically with BigQuery.

I am trying use stage_external_sources option...I have installed the package. Very basic I trying to run the example models that comes with dbt core install with the aim of creating an external hive table with the data being store into a GCS bucket.

Here is how my package yml file looks like.

enter image description here

And when run the dbt enter image description here

The dbt run command runs successfully but the data is not getting written to the GCS bucket. Any help on this would be greatly appreciated.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM