简体   繁体   English

使用Google Cloud Storage API处理大文件

[英]Handling big files with Google Cloud Storage API

What I need to achieve is to concatenate a list of files into a single file, using the cloudstorage library. 我需要实现的是使用cloudstorage库将文件列表连接到一个文件中。 This needs to happen inside a mapreduce shard, which has a 512MB upper limit on memory, but the concatenated file could be larger than 512MB. 这需要在mapreduce碎片内部进行,该碎片的内存上限为512MB,但是串联的文件可能大于512MB。

The following code segment breaks when file size hit the memory limit. 当文件大小达到内存限制时,以下代码段将中断。

list_of_files = [...]
with cloudstorage.open(filename...) as file_handler:
    for a in list_of_files:
        with cloudstorage.open(a) as f:
            file_handler.write(f.read())

Is there a way to walk around this issue? 有没有办法解决这个问题? Maybe open or append files in chunk? 也许打开或大块追加文件? And how to do that? 以及如何做到这一点? Thanks! 谢谢!

== EDIT == ==编辑==

After some more testing, it seems that memory limit only applies to f.read() , while writing to a large file is okay. 经过更多测试后,似乎内存限制仅适用于f.read() ,而写入大文件是可以的。 Reading files in chunks solved my issue, but I really like the compose() function as @Ian-Lewis pointed out. 批量读取文件解决了我的问题,但是我真的很喜欢@ Ian-Lewis指出的compose()函数。 Thanks! 谢谢!

For large file you will want to break the file up into smaller files, upload each of those and then merge them together as composite objects . 对于大文件,您需要将其分解为较小的文件,上传每个文件,然后将它们合并为复合对象 You will want to use the compose() function from the library. 您将要使用库中的compose()函数 It seems there is no docs on it yet . 似乎还没有文档

After you've uploaded all the parts something like the following should work. 上传完所有部分后,应执行以下操作。 One thing to make sure of is that the paths files to be composed don't contain the bucket name or a slash at the beginning. 要确保的一件事是要组成的路径文件开头不包含存储桶名称或斜杠。

stat = cloudstorage.compose(
    [
        "path/to/part1",
        "path/to/part2",
        "path/to/part3",
        # ...
    ],
    "/my_bucket/path/to/output"
)

You may also want to check out using the gsutil tool if possible. 如果可能,您可能还想使用gsutil工具签出。 It can do automatic splitting, uploading in parallel, and compositing of large files for you. 它可以为您自动分割,并行上传以及合成大文件

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM