简体   繁体   English

谷歌云存储加入多个 csv 文件

[英]Google Cloud Storage Joining multiple csv files

I exported a dataset from Google BigQuery to Google Cloud Storage, given the size of the file BigQuery exported the file as 99 csv files.鉴于文件 BigQuery 将文件导出为 99 个 csv 文件的大小,我将数据集从 Google BigQuery 导出到 Google Cloud Storage。

However now I want to connect to my GCP Bucket and perform some analysis with Spark, yet I need to join all 99 files into a single large csv file to run my analysis.但是现在我想连接到我的 GCP 存储桶并使用 Spark 执行一些分析,但我需要将所有 99 个文件加入一个大型 csv 文件来运行我的分析。

How can this be achieved?如何做到这一点?

BigQuery splits the data exported into several files if it is larger than 1GB . 如果BigQuery 大于1GB, BigQuery会将导出的数据拆分为多个文件。 But you can merge these files with the gsutil tool , check this official documentation to know how to perform object composition with gsutil. 但是您可以将这些文件与gsutil工具合并,请查看此官方文档以了解如何使用gsutil执行对象组合。

As BigQuery export the files with the same prefix, you can use a wildcard * to merge them into one composite object: 当BigQuery导出具有相同前缀的文件时,您可以使用通配符*将它们合并到一个复合对象中:

gsutil compose gs://example-bucket/component-obj-* gs://example-bucket/composite-object

Note that there is a limit (currently 32) to the number of components that can be composed in a single operation. 请注意,在单个操作中可以组成的组件数量存在限制(当前为32)。

The downside of this option is that the header row of each .csv file will be added in the composite object. 此选项的缺点是每个.csv文件的标题行将添加到复合对象中。 But you can avoid this by modifiyng the jobConfig to set the print_header parameter to False . 但是你可以通过修改jobConfigprint_header参数设置为False来避免这种情况。

Here is a Python sample code, but you can use any other BigQuery Client library : 这是一个Python示例代码,但您可以使用任何其他BigQuery Client库

from google.cloud import bigquery
client = bigquery.Client()
bucket_name = 'yourBucket'

project = 'bigquery-public-data'
dataset_id = 'libraries_io'
table_id = 'dependencies'

destination_uri = 'gs://{}/{}'.format(bucket_name, 'file-*.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)

job_config = bigquery.job.ExtractJobConfig(print_header=False)

extract_job = client.extract_table(
    table_ref,
    destination_uri,
    # Location must match that of the source table.
    location='US',
    job_config=job_config)  # API request

extract_job.result()  # Waits for job to complete.

print('Exported {}:{}.{} to {}'.format(
    project, dataset_id, table_id, destination_uri))

Finally, remember to compose an empty .csv with just the headers row. 最后,请记住只用标题行组成一个空的.csv

I got tired kind tired of doing multiple recursive compose operations, stripping headers, etc... Especially when dealing with 3500 split gzipped csv files.我厌倦了多次递归编写操作,剥离标题等......尤其是在处理 3500 个拆分 gzip 压缩的 csv 文件时。

Therefore wrote a CSV Merge (Sorry windows only though) to solve exactly this problem.因此写了一个 CSV 合并(抱歉 windows 只是)来解决这个问题。

https://github.com/tcwicks/DataUtilities https://github.com/tcwicks/DataUtilities

Download latest release, unzip and use.下载最新版本,解压并使用。

Also wrote an article with a use case and usage example for it:还写了一篇文章,里面有一个用例和使用示例:

https://medium.com/@TCWicks/merge-multiple-csv-flat-files-exported-from-bigquery-redshift-etc-d10aa0a36826 https://medium.com/@TCWicks/merge-multiple-csv-flat-files-exported-from-bigquery-redshift-etc-d10aa0a36826

Hope it is of use to someone.希望它对某人有用。

ps Recommend tab delimited over CSV as it tends to have less data issues. ps 建议使用 CSV 分隔制表符,因为它往往具有较少的数据问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM