[英]How to combine .csv files in several shards in Google Cloud bucket?
I have moved a public dataset available as a Public Google Storage Bucket into my own bucket.我已将一个可用作公共 Google 存储桶的公共数据集移动到我自己的桶中。 The file size is about 10 GB.
文件大小约为 10 GB。 When the data moved, the file was split into about 47 shards, all compressed.
当数据移动时,文件被分成大约 47 个碎片,全部压缩。 I am unable to combine them into one file.
我无法将它们合并到一个文件中。 How can I combine them?
我怎样才能将它们结合起来?
Information given on the following link does not help much:以下链接上提供的信息没有多大帮助:
https://cloud.google.com/storage/docs/gsutil/commands/compose https://cloud.google.com/storage/docs/gsutil/commands/compose
My bucket looks like this:我的桶看起来像这样:
Any help will be appreciated.任何帮助将不胜感激。
I propose you to use Cloud Build .我建议您使用Cloud Build 。 It's not the most obvious solution, but it's serverless and cheap.
这不是最明显的解决方案,但它是无服务器且便宜的。 Perfect for your 1 time use case.
非常适合您的 1 次用例。 here what I propose to perform
在这里我建议执行
steps:
- name: 'gcr.io/cloud-builders/gsutil'
entrypoint: "bash"
args:
- -c
- |
# copy all your files locally
gsutil -m cp gs://311_nyc/311* .
# Uncompress your file
# I don't know your compression method? gunzip?
# append your file in a merged file. Delete the files after the merge.
for file in $(ls -1 311* ); do cat $file >> merged; rm $file; done
# Copy the file to the destination bucket
gsutil cp merged gs://myDestinationBucket/myName.csv
options:
# Use 1Tb of disk for getting all the files in the same time on the same server.
# I didn't understand is the 10Gb is per uncompressed file or the total size.
# If it's the total file size, I think that this option is useless
diskSizeGb: 1000
# Optionally extend the default 10 minutes timeout if it takes too much time.
timeout: 660s
Combine using nodejs使用 nodejs 组合
const { Storage } = require('@google-cloud/storage');
await storage.bucket(bucketName).combine(sourceFilenameList, destFilename)
If anyone else needs something like this.如果其他人需要这样的东西。
This should do what you need:这应该做你需要的:
Command line utility: https://github.com/tcwicks/DataUtilities命令行实用程序: https://github.com/tcwicks/DataUtilities
download latest release, unzip and use下载最新版本,解压并使用
Explanation / guide: https://medium.com/@TCWicks/merge-multiple-csv-flat-files-exported-from-bigquery-redshift-etc-d10aa0a36826说明/指南: https://medium.com/@TCWicks/merge-multiple-csv-flat-files-exported-from-bigquery-redshift-etc-d10aa0a36826
Hope someone finds it useful.希望有人觉得它有用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.