简体   繁体   English

如何使用 dbt 上传包含换行符的 csv 数据

[英]How to upload csv data that contains newline with dbt

I have a 3rd party generated CSV file that I wish to upload to Google BigQuery using dbt seed .我有一个第 3 方生成的 CSV 文件,我希望使用dbt seed上传到 Google BigQuery。

I manage to upload it manually to BigQuery, but I need to enable "Quoted newlines" which is off by default.我设法将它手动上传到 BigQuery,但我需要启用默认关闭的“引用换行符”。

When I run dbt seed , I get the following error:当我运行dbt seed时,我收到以下错误:

16:34:43  Runtime Error in seed clickup_task (data/clickup_task.csv)
16:34:43    Error while reading data, error message: CSV table references column position 31, but line starting at position:304 contains only 4 columns.

There are 32 columns in the CSV. CSV 共有 32 列。 The file contains column values with newlines.该文件包含带有换行符的列值。 I guess that's where the dbt parser fails.我猜这就是 dbt 解析器失败的地方。 I checked the dbt seed configuration options , but I haven't found anything relevant.我检查了dbt 种子配置选项,但没有发现任何相关内容。

Any ideas?有任何想法吗?

As far as I know - the seed feature is very limited by what is built into dbt-core.据我所知 - 种子功能受到 dbt-core 内置内容的限制。 So seeds is not the way that I go here.所以种子不是我这里的 go 的方式。 You can see the history of requests for the expansion of seed options here on the dbt-cre issues repo (including my own request for similar optionality #3990 ) but I have to see any real traction on this.您可以在 dbt-cre 问题 repo 上查看扩展种子选项 请求历史(包括我自己对类似可选性的请求#3990 ),但我必须看到这方面的任何真正吸引力。


That said, what has worked very well for me is to store flat files within the gcp project in a gcs bucket and then utilize the dbt-external-tables package for very similar but much more robust file structuring.也就是说,对我来说效果很好的是将 gcp 项目中的平面文件存储在 gcs 存储桶中,然后利用dbt-external-tables package 进行非常相似但更健壮的文件结构。 Managing this can be a lot of overhead I know but becomes very very worth it if your seed files continue expanding in a way that can take advantage of partitioning for instance.我知道管理这可能会产生很多开销,但如果您的种子文件继续以可以利用分区的方式扩展,那么它变得非常值得。

And more importantly - as mentioned in this answer from Jeremy on stackoverflow,更重要的是 - 正如 Jeremy 在 stackoverflow 上的回答中提到的,

The dbt-external-tables package supports passing a dictionary of options for BigQuery external tables, which maps to the options documented here . dbt-external-tables package 支持传递 BigQuery 外部表的选项字典,该字典映射到此处记录的选项

Which for your case, should be either the quote or allowQuotedNewlines options.对于您的情况,应该是quoteallowQuotedNewlines选项。 If you did choose to use dbt-external-tables your source.yml for this would look something like:如果您确实选择使用dbt-external-tables您的 source.yml 将类似于:

gcs.yml

version: 2

sources:
  - name: clickup
    database: external_tables
    loader: gcloud storage
  
    tables:
      - name: task
        description: "External table of Snowplow events, stored as CSV files in Cloud Storage"
        external:
          location: 'gs://bucket/clickup/task/*'
          options:
            format: csv
            skip_leading_rows: 1
            quote: "\""
            allow_quoted_newlines: true

Or something very similar.或者非常相似的东西。 And if you end up taking this path and storing task data on a daily partition like, tasks_2022_04_16.csv - you can access that file name and other metadata the provided pseudocolumns also shared with me by Jeremy here:如果您最终采用此路径并将任务数据存储在每日分区上,例如tasks_2022_04_16.csv - 您可以访问该文件名和其他元数据,提供的伪列也由 Jeremy 在这里与我共享:

Retrieve "filename" from gcp storage during dbt-external-tables sideload? 在 dbt-external-tables 侧载期间从 gcp 存储中检索“文件名”?

I find it to be a very powerful set of tools for files specifically with BigQuery.我发现它是一套非常强大的工具,专门用于 BigQuery 的文件。

I am trying use stage_external_sources option...I have installed the package.我正在尝试使用 stage_external_sources 选项...我已经安装了 package。 Very basic I trying to run the example models that comes with dbt core install with the aim of creating an external hive table with the data being store into a GCS bucket.非常基本,我尝试运行 dbt core install 附带的示例模型,目的是创建一个外部 hive 表,并将数据存储到 GCS 存储桶中。

Here is how my package yml file looks like.这是我的 package yml 文件的样子。

enter image description here在此处输入图像描述

And when run the dbt enter image description here当运行 dbt在此处输入图像描述

The dbt run command runs successfully but the data is not getting written to the GCS bucket. dbt run 命令成功运行,但数据未写入 GCS 存储桶。 Any help on this would be greatly appreciated.对此的任何帮助将不胜感激。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM