使用boto3和python从amazon s3读取zip文件

Question

I have an s3 bucket which has a large no of zip files having size in GBs. 我有一个s3存储桶，其中有大量的zip文件，其大小以GB为单位。 I need to calculate all zip files data length. 我需要计算所有zip文件的数据长度。 I go through boto3 but didn't get it. 我通过boto3但没有得到它。 I am not sure if it can directly read zip file or not but I have a process- 我不确定它是否可以直接读取zip文件但我有一个过程 -

Connect with the bucket. 与水桶连接。
Read zip files from the bucket folder (Let's say folder is Mydata). 从bucket文件夹中读取zip文件（假设文件夹是Mydata）。
Extract zip files to another folder named Extracteddata. 将zip文件解压缩到名为Extracteddata的另一个文件夹。
Read Extracteddata folder and do action on files. 阅读Extracteddata文件夹并对文件执行操作。

Note: Nothing shouldn't download on local storage. 注意：不应该在本地存储上下载任何内容。 All process goes on S3 to S3. 所有过程都在S3到S3上进行。 Any suggestions are appreciated. 任何建议表示赞赏。

Answer 1

What you want to do is impossible, as explained by John Rotenstein's answer . 正如John Rotenstein的回答所解释的那样，你想要做的事是不可能的。 You have to download the zipfile—not necessarily to local storage , but at least to local memory , using up your local bandwidth. 您必须下载zip文件 - 不一定是本地存储，但至少要本地内存，使用本地带宽。 There's no way to run any code on S3. 无法在S3上运行任何代码。

However, there may be a way to get what you're really after here anyway. 但是，无论如何，可能有一种方法可以获得你真正想要的东西。

If you could just download, say, 8KB worth of the file, instead of the whole 5GB, would that be good enough? 如果你可以下载，比如8KB的文件，而不是整个5GB，那还不错吗？ If so—and if you're willing to do a bit of work—then you're in luck. 如果是这样 - 如果你愿意做一些工作 - 那么你很幸运。 What if you had to download, say, 1MB, but could do a lot less work? 如果你不得不下载1MB，但可以减少多少工作呢？

If 1MB doesn't sound too bad, and you're willing to get a little hacky: 如果1MB听起来不是太糟糕，而且你愿意得到一点点hacky：

The only thing you want is a count of how many files are in the zipfile. 你唯一想要的是计算zipfile中有多少文件。 For a zipfile, all of that information is available in the central directory, a very small chunk of data at the very end of the file. 对于zipfile，所有这些信息都可以在中央目录中找到，这是文件最末尾的一小块数据。

And if you have the entire central directory, even if you're missing the rest of the file, the zipfile module in the stdlib will handle it just fine. 如果你有整个中心目录，即使你错过了文件的其余部分，stdlib中的zipfile模块也会处理它。 It isn't documented to do so, but, at least in the versions included in recent CPython and PyPy 3.x, it definitely will. 没有记录这样做，但是，至少在最近的CPython和PyPy 3.x中包含的版本中，它肯定会。

So, what you can do is this: 那么，你能做的就是：

Make a HEAD request to get just the headers. 发出HEAD请求以获取标题。 (In boto , you do this with head_object .) （在boto ，您使用head_object执行此head_object 。）
Extract the file size from the Content-Length header. 从Content-Length标头中提取文件大小。
Make a GET request with a Range header to only download from, say, size-1048576 to the end. 使用Range标头发出GET请求，仅从size-1048576下载到结尾。 (In boto , I believe you may have to call get_object instead of one of the download* convenience methods, and you have to format the Range header value yourself.) （在boto ，我相信您可能必须调用get_object而不是download*便捷方法之一，并且您必须自己格式化Range标头值。）

Now, assuming you've got that last 1MB in a buffer buf : 现在，假设你在缓冲区buf中有最后1MB：

z = zipfile.ZipFile(io.BytesIO(buf))
count = len(z.filelist)

Usually, 1MB is more than enough. 通常，1MB就足够了。 But what about when it isn't? 但是什么时候不是呢？ Well, here's where things get a little hacky. 嗯，这里的事情变得有点hacky。 The zipfile module knows how many more bytes you need—but the only place it gives you that information is in the text of the exception description. zipfile模块知道你需要多少字节 - 但它给你的唯一信息是在异常描述的文本中。 So: 所以：

try:
    z = zipfile.ZipFile(io.BytesIO(buf))
except ValueError as e:
    m = re.match(r'negative seek value -(\d+)', z.args[0])
    if not m:
        raise
    extra = int(m.group(1))
    # now go read from size-1048576-extra to size-1048576, prepend to buf, try again
count = len(z.filelist)

If 1MB already sounds like too much bandwidth, or you don't want to rely on undocumented behavior of the zipfile module, you just need to do a bit more work. 如果1MB已经听起来像带宽太多，或者你不想依赖zipfile模块的未记录行为，那么你只需要做更多的工作。

In almost every case, you don't even need the whole central directory, just the total number of entries field within the end of central directory record —an even smaller chunk of data at the very end of the central directory. 几乎在每种情况下，您甚至不需要整个中央目录，只需要在end of central directory record total number of entries字段 - 在end of central directory record的最末端甚至是更小的数据块。

So, do the same as above, but only read the last 8KB instead of the last 1MB. 因此，执行与上面相同的操作，但只读取最后的8KB而不是最后的1MB。

And then, based on the zip format spec , write your own parser. 然后，根据zip格式规范，编写自己的解析器。

Of course you don't need to write a complete parser, or even close to it. 当然，您不需要编写完整的解析器，甚至不需要编写它。 You just need enough to deal with the fields from total number of entries to the end. 您只需要足够的时间来处理从total number of entries到结尾的字段。 All of which are fixed-size fields except for zip64 extensible data sector and/or .ZIP file comment . 除zip64 extensible data sector和/或.ZIP file comment外，所有这些都是固定大小的字段。

Occasionally (eg, for zipfiles with huge comments), you will need to read more data to get the count. 偶尔（例如，对于带有大量注释的zipfiles），您需要阅读更多数据才能获得计数。 This should be pretty rare, but if, for some reason, it turns out to be more common with your zipfiles, you can just change that 8192 guess to something larger. 这应该是非常罕见的，但是如果由于某种原因，你的zipfiles更常见，你可以将8192猜测更改为更大的值。

Answer 2

This is not possible. 这是不可能的。

You can upload files to Amazon S3 and you can download files. 您可以将文件上传到Amazon S3，也可以下载文件。 You can query the list of objects and obtain metadata about the objects. 您可以查询对象列表并获取有关对象的元数据。 However, Amazon S3 does not provide compute, such as zip compression/decompression . 但是， Amazon S3不提供计算，例如zip压缩/解压缩 。

You would need to write a program that: 您需要编写一个程序 ：

Downloads the zip file 下载zip文件
Extracts the files 提取文件
Does actions on the files 对文件执行操作

This is probably best done on an Amazon EC2 instance , which would have low-latency access to Amazon S3 . 这可能最好在Amazon EC2实例上完成，该实例可以对Amazon S3进行低延迟访问。 You could do it with an AWS Lambda function, but it has a limit of 500MB disk storage and 5 minutes of execution, which doesn't seem applicable to your situation. 您可以使用AWS Lambda函数执行此操作，但它具有500MB磁盘存储空间和5分钟执行时间，这似乎不适用于您的情况。

If you are particularly clever, you might be able to download part of each zip file ('ranged get') and interpret the zipfile header to obtain a listing of the files and their sizes, thus avoiding having to download the whole file. 如果你特别聪明，你可以下载每个zip文件的一部分 （'ranged get'）并解释zipfile标题以获取文件及其大小的列表，从而避免必须下载整个文件。

使用boto3和python从amazon s3读取zip文件

问题描述

2 个解决方案

解决方案1
3 已采纳 2018-07-31 03:58:21

解决方案2
0 2018-07-31 03:09:17

使用boto3和python从amazon s3读取zip文件

问题描述

2 个解决方案

解决方案1 3 已采纳 2018-07-31 03:58:21

解决方案2 0 2018-07-31 03:09:17

解决方案1
3 已采纳 2018-07-31 03:58:21

解决方案2
0 2018-07-31 03:09:17