简体繁体 English

加载 csv.gz 从 url 到 bigquery

[英]Loading csv.gz from url to bigquery

原文 2022-11-12 09:04:29 8 1 python/ google-bigquery/ gzip

I am trying to load all the csv.gz files from this url to google bigquery.我正在尝试将所有 csv.gz 文件从这个 url 加载到谷歌 bigquery。 What is the best way to do this?做这个的最好方式是什么？

I tried using pyspark to read the csv.gz files (as I need to perform some data cleaning on these files) but I realized that pyspark doesn't support directly reading files from url. Would it make sense to load the cleaned versions of the csv.gz files into BigQuery or should I dump the raw,original csv.gz files in BigQuery and perform my cleaning process in BigQuery itself?我尝试使用 pyspark 读取 csv.gz 文件（因为我需要对这些文件执行一些数据清理）但我意识到 pyspark 不支持直接从 url 读取文件。加载清理后的版本是否有意义csv.gz 文件到 BigQuery 中还是我应该将原始的原始 csv.gz 文件转储到 BigQuery 中并在 BigQuery 本身中执行我的清理过程？

I was reading the "Google BigQuery: The Definitive Guide" book and it suggests to load the data on Google Cloud Storage.我正在阅读“Google BigQuery：权威指南”一书，它建议将数据加载到 Google Cloud Storage 上。 Do I have to load each csv.gz file into Google Cloud Storage or is there an easier way to do this?我是否必须将每个 csv.gz 文件加载到 Google Cloud Storage 中，或者是否有更简单的方法来执行此操作？

Thanks for your help!谢谢你的帮助！

1 个解决方案

As @Samuel mentioned, you can use the curl command to download the files from the URL and then copy the files to GCS bucket.正如@Samuel 提到的，您可以使用curl命令从 URL 下载文件，然后将文件复制到 GCS 存储桶。 If you have heavy transformations to be done on the data I would recommend using Cloud Dataflow otherwise you can go for Cloud Dataprep workflow and finally export your clean data to BigQuery table.如果您要对数据进行大量转换，我建议您使用 Cloud Dataflow ，否则您可以使用 go 进行 Cloud Dataprep工作流，最后将干净的数据导出到 BigQuery 表。

Choosing BigQuery for transformations totally depends upon your use-case, data size and budget ie, if you have high volume then direct transformations could be costly.选择 BigQuery 进行转换完全取决于您的用例、数据大小和预算，即，如果您的数据量很大，那么直接转换可能会很昂贵。