简体   繁体   English

如何将.csv 文件合并到谷歌云存储桶的几个分片中?

[英]How to combine .csv files in several shards in Google Cloud bucket?

I have moved a public dataset available as a Public Google Storage Bucket into my own bucket.我已将一个可用作公共 Google 存储桶的公共数据集移动到我自己的桶中。 The file size is about 10 GB.文件大小约为 10 GB。 When the data moved, the file was split into about 47 shards, all compressed.当数据移动时,文件被分成大约 47 个碎片,全部压缩。 I am unable to combine them into one file.我无法将它们合并到一个文件中。 How can I combine them?我怎样才能将它们结合起来?

Information given on the following link does not help much:以下链接上提供的信息没有多大帮助:

https://cloud.google.com/storage/docs/gsutil/commands/compose https://cloud.google.com/storage/docs/gsutil/commands/compose

My bucket looks like this:我的桶看起来像这样:

在此处输入图像描述

Any help will be appreciated.任何帮助将不胜感激。

I propose you to use Cloud Build .我建议您使用Cloud Build It's not the most obvious solution, but it's serverless and cheap.这不是最明显的解决方案,但它是无服务器且便宜的。 Perfect for your 1 time use case.非常适合您的 1 次用例。 here what I propose to perform在这里我建议执行

steps:
- name: 'gcr.io/cloud-builders/gsutil'
  entrypoint: "bash"
  args: 
    - -c
    - |
       # copy all your files locally
       gsutil -m cp gs://311_nyc/311* .

       # Uncompress your file
       # I don't know your compression method? gunzip?

       # append your file in a merged file. Delete the files after the merge.
       for file in $(ls -1 311* ); do cat $file >> merged; rm $file; done

       # Copy the file to the destination bucket
       gsutil cp merged gs://myDestinationBucket/myName.csv

options:
# Use 1Tb of disk for getting all the files in the same time on the same server. 
# I didn't understand is the 10Gb is per uncompressed file or the total size. 
# If it's the total file size, I think that this option is useless
 diskSizeGb: 1000

# Optionally extend the default 10 minutes timeout if it takes too much time.
 timeout: 660s

Combine using nodejs使用 nodejs 组合

const { Storage } = require('@google-cloud/storage');
await storage.bucket(bucketName).combine(sourceFilenameList, destFilename)

If anyone else needs something like this.如果其他人需要这样的东西。

This should do what you need:这应该做你需要的:

Command line utility: https://github.com/tcwicks/DataUtilities命令行实用程序: https://github.com/tcwicks/DataUtilities

download latest release, unzip and use下载最新版本,解压并使用

Explanation / guide: https://medium.com/@TCWicks/merge-multiple-csv-flat-files-exported-from-bigquery-redshift-etc-d10aa0a36826说明/指南: https://medium.com/@TCWicks/merge-multiple-csv-flat-files-exported-from-bigquery-redshift-etc-d10aa0a36826

Hope someone finds it useful.希望有人觉得它有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 Cloud Function 触发器组合 GCS 存储桶中的多个文件 - How to combine multiple files in GCS bucket with Cloud Function trigger 如何将 BigQuery 视图作为 csv 文件传输到 Google Cloud Storage 存储桶 - How to Transfer a BigQuery view to a Google Cloud Storage bucket as a csv file 如何将 Google Cloud Storage 中的文件从一个存储桶移动到另一个存储桶 Python - How to move files in Google Cloud Storage from one bucket to another bucket by Python 从谷歌云存储桶中读取文件 - Reading files from google cloud storage bucket 谷歌云存储加入多个 csv 文件 - Google Cloud Storage Joining multiple csv files 从 GOOGLE CLOUD STORAGE BUCKET 下载多个文件 - Download multiple files from GOOGLE CLOUD STORAGE BUCKET 将文件从 Azure blob 存储移动到 Google 云存储桶 - Moving Files from Azure blob storage to Google cloud storage bucket 如何防止删除 Google Cloud Storage Bucket 中的某个文件夹? - How to prevent the remove of a certain folder in a Google Cloud Storage Bucket? 如何正确初始化谷歌云存储和存储桶和/或初始化 Multer - How to properly initialize google cloud storage and bucket and/or intialize Multer 如何从谷歌云存储桶中获取项目名称/ID? - How to get project name/id from Google Cloud Storage bucket?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM