Python S3 下载 zip 文件

Question

I have zip files uploaded to S3.我有上传到 S3 的 zip 文件。 I'd like to download them for processing.我想下载它们进行处理。 I don't need to permanently store them, but I need to temporarily process them.我不需要永久存储它们，但我需要临时处理它们。 How would I go about doing this?我该怎么做呢？

Answer 1

Because working software > comprehensive documentation :因为工作软件 > 综合文档：

Boto2博托2

import zipfile
import boto
import io

# Connect to s3
# This will need your s3 credentials to be set up 
# with `aws configure` using the aws CLI.
#
# See: https://aws.amazon.com/cli/
conn = boto.s3.connect_s3()

# get hold of the bucket
bucket = conn.get_bucket("my_bucket_name")

# Get hold of a given file
key = boto.s3.key.Key(bucket)
key.key = "my_s3_object_key"

# Create an in-memory bytes IO buffer
with io.BytesIO() as b:

    # Read the file into it
    key.get_file(b)

    # Reset the file pointer to the beginning
    b.seek(0)

    # Read the file as a zipfile and process the members
    with zipfile.ZipFile(b, mode='r') as zipf:
        for subfile in zipf.namelist():
            do_stuff_with_subfile()

Boto3博托3

import zipfile
import boto3
import io

# this is just to demo. real use should use the config 
# environment variables or config file.
#
# See: http://boto3.readthedocs.org/en/latest/guide/configuration.html

session = boto3.session.Session(
    aws_access_key_id="ACCESSKEY", 
    aws_secret_access_key="SECRETKEY"
)

s3 = session.resource("s3")
bucket = s3.Bucket('stackoverflow-brice-test')
obj = bucket.Object('smsspamcollection.zip')

with io.BytesIO(obj.get()["Body"].read()) as tf:

    # rewind the file
    tf.seek(0)

    # Read the file as a zipfile and process the members
    with zipfile.ZipFile(tf, mode='r') as zipf:
        for subfile in zipf.namelist():
            print(subfile)

Tested on MacOSX with Python3.在 MacOSX 上用 Python3 测试。

Answer 2

If speed is a concern, a good approach would be to choose an EC2 instance fairly close to your S3 bucket (in the same region) and use that instance to unzip/process your zipped files.如果速度是一个问题，一个好的方法是选择一个离您的 S3 存储桶（在同一区域）相当近的 EC2 实例，并使用该实例来解压缩/处理您的压缩文件。

This will allow for a latency reduction and allow you to process them fairly efficiently.这将允许减少延迟并允许您相当有效地处理它们。 You can remove each extracted file after finishing your work.完成工作后，您可以删除每个提取的文件。

Note: This will only work if you are fine using EC2 instances.注意：这仅适用于您可以正常使用 EC2 实例的情况。

Answer 3

I believe you have heard boto which is Python interface to Amazon Web Services我相信你听说过boto ，它是Python interface to Amazon Web Services

You can get key from s3 to file .您可以从s3获取key到file 。

import boto
import zipfile.ZipFile as ZipFile

s3 = boto.connect_s3() # connect
bucket = s3.get_bucket(bucket_name) # get bucket
key = bucket.get_key(key_name) # get key (the file in s3)
key.get_file(local_name) # set this to temporal file

with ZipFile(local_name, 'r') as myzip:
    # do something with myzip

os.unlink(local_name) # delete it

You can also use tempfile .您也可以使用tempfile 。 For more detail, see create & read from tempfile有关更多详细信息，请参阅从临时文件创建和读取

Answer 4

Reading certain file from a zip file from S3 bucket.从 S3 存储桶的 zip 文件中读取特定文件。

import boto3
import os
import zipfile
import io
import json


'''
When you configure awscli, you\'ll set up a credentials file located at 
~/.aws/credentials. By default, this file will be used by Boto3 to authenticate.
'''
os.environ['AWS_PROFILE'] = "<profile_name>"
os.environ['AWS_DEFAULT_REGION'] = "<region_name>"

# Let's use Amazon S3
s3_name = "<bucket_name>"
zip_file_name = "<zip_file_name>"
file_to_open = "<file_to_open>"
s3 = boto3.resource('s3')
obj = s3.Object(s3_name, zip_file_name )

with io.BytesIO(obj.get()["Body"].read()) as tf:
    # rewind the file
    tf.seek(0)
    # Read the file as a zipfile and process the members
    with zipfile.ZipFile(tf, mode='r') as zipf:
        file_contents= zipf.read(file_to_open).decode("utf-8")
        print(file_contents)

reference from @brice answer.来自@brice 答案的参考。

Answer 5

Pandas provides a shortcut for this, which removes most of the code from the top answer , and allows you to be agnostic about whether your file path is on s3, gcp, or your local machine. Pandas 为此提供了一个快捷方式，它从top answer 中删除了大部分代码，并允许您不知道您的文件路径是在 s3、gcp 还是本地计算机上。

import pandas as pd  

obj = pd.io.parsers.get_filepath_or_buffer(file_path)[0]
with io.BytesIO(obj.read()) as byte_stream:
    # Use your byte stream, to, for example, print file names...
    with zipfile.ZipFile(byte_stream, mode='r') as zipf:
        for subfile in zipf.namelist():
            print(subfile)

Answer 6

Adding on to @brice answer添加到@brice 答案

Here is the code if you want to read any data inside the file line by line如果你想逐行读取文件中的任何数据，这是代码

with zipfile.ZipFile(tf, mode='r') as zipf:
    for line in zipf.read("xyz.csv").split(b"\n"):
        print(line)
        break # to break off after the first line

Hope this helps!希望这可以帮助！

Python S3 下载 zip 文件

问题描述

6 个解决方案

解决方案1
24 2015-01-28 15:49:18

Boto2博托2

Boto3博托3

解决方案2
3 2014-04-29 23:11:14

解决方案3
1 2014-04-30 00:00:35

解决方案4
1 2021-09-24 13:36:08

Reading certain file from a zip file from S3 bucket.从 S3 存储桶的 zip 文件中读取特定文件。

解决方案5
0 2020-04-02 23:19:48

解决方案6
0 2022-12-20 09:18:09

Python S3 下载 zip 文件

问题描述

6 个解决方案

解决方案1 24 2015-01-28 15:49:18

Boto2博托2

Boto3博托3

解决方案2 3 2014-04-29 23:11:14

解决方案3 1 2014-04-30 00:00:35

解决方案4 1 2021-09-24 13:36:08

Reading certain file from a zip file from S3 bucket.从 S3 存储桶的 zip 文件中读取特定文件。

解决方案5 0 2020-04-02 23:19:48

解决方案6 0 2022-12-20 09:18:09

解决方案1
24 2015-01-28 15:49:18

解决方案2
3 2014-04-29 23:11:14

解决方案3
1 2014-04-30 00:00:35

解决方案4
1 2021-09-24 13:36:08

解决方案5
0 2020-04-02 23:19:48

解决方案6
0 2022-12-20 09:18:09