简体   繁体   English

Python S3 下载 zip 文件

[英]Python S3 download zip file

I have zip files uploaded to S3.我有上传到 S3 的 zip 文件。 I'd like to download them for processing.我想下载它们进行处理。 I don't need to permanently store them, but I need to temporarily process them.我不需要永久存储它们,但我需要临时处理它们。 How would I go about doing this?我该怎么做呢?

Because working software > comprehensive documentation :因为工作软件 > 综合文档

Boto2博托2

import zipfile
import boto
import io

# Connect to s3
# This will need your s3 credentials to be set up 
# with `aws configure` using the aws CLI.
#
# See: https://aws.amazon.com/cli/
conn = boto.s3.connect_s3()

# get hold of the bucket
bucket = conn.get_bucket("my_bucket_name")

# Get hold of a given file
key = boto.s3.key.Key(bucket)
key.key = "my_s3_object_key"

# Create an in-memory bytes IO buffer
with io.BytesIO() as b:

    # Read the file into it
    key.get_file(b)

    # Reset the file pointer to the beginning
    b.seek(0)

    # Read the file as a zipfile and process the members
    with zipfile.ZipFile(b, mode='r') as zipf:
        for subfile in zipf.namelist():
            do_stuff_with_subfile()

Boto3博托3

import zipfile
import boto3
import io

# this is just to demo. real use should use the config 
# environment variables or config file.
#
# See: http://boto3.readthedocs.org/en/latest/guide/configuration.html

session = boto3.session.Session(
    aws_access_key_id="ACCESSKEY", 
    aws_secret_access_key="SECRETKEY"
)

s3 = session.resource("s3")
bucket = s3.Bucket('stackoverflow-brice-test')
obj = bucket.Object('smsspamcollection.zip')

with io.BytesIO(obj.get()["Body"].read()) as tf:

    # rewind the file
    tf.seek(0)

    # Read the file as a zipfile and process the members
    with zipfile.ZipFile(tf, mode='r') as zipf:
        for subfile in zipf.namelist():
            print(subfile)

Tested on MacOSX with Python3.在 MacOSX 上用 Python3 测试。

If speed is a concern, a good approach would be to choose an EC2 instance fairly close to your S3 bucket (in the same region) and use that instance to unzip/process your zipped files.如果速度是一个问题,一个好的方法是选择一个离您的 S3 存储桶(在同一区域)相当近的 EC2 实例,并使用该实例来解压缩/处理您的压缩文件。

This will allow for a latency reduction and allow you to process them fairly efficiently.这将允许减少延迟并允许您相当有效地处理它们。 You can remove each extracted file after finishing your work.完成工作后,您可以删除每个提取的文件。

Note: This will only work if you are fine using EC2 instances.注意:这仅适用于您可以正常使用 EC2 实例的情况。

I believe you have heard boto which is Python interface to Amazon Web Services我相信你听说过boto ,它是Python interface to Amazon Web Services

You can get key from s3 to file .您可以从s3获取keyfile

import boto
import zipfile.ZipFile as ZipFile

s3 = boto.connect_s3() # connect
bucket = s3.get_bucket(bucket_name) # get bucket
key = bucket.get_key(key_name) # get key (the file in s3)
key.get_file(local_name) # set this to temporal file

with ZipFile(local_name, 'r') as myzip:
    # do something with myzip

os.unlink(local_name) # delete it

You can also use tempfile .您也可以使用tempfile For more detail, see create & read from tempfile有关更多详细信息,请参阅从临时文件创建和读取

Reading certain file from a zip file from S3 bucket.从 S3 存储桶的 zip 文件中读取特定文件。

import boto3
import os
import zipfile
import io
import json


'''
When you configure awscli, you\'ll set up a credentials file located at 
~/.aws/credentials. By default, this file will be used by Boto3 to authenticate.
'''
os.environ['AWS_PROFILE'] = "<profile_name>"
os.environ['AWS_DEFAULT_REGION'] = "<region_name>"

# Let's use Amazon S3
s3_name = "<bucket_name>"
zip_file_name = "<zip_file_name>"
file_to_open = "<file_to_open>"
s3 = boto3.resource('s3')
obj = s3.Object(s3_name, zip_file_name )

with io.BytesIO(obj.get()["Body"].read()) as tf:
    # rewind the file
    tf.seek(0)
    # Read the file as a zipfile and process the members
    with zipfile.ZipFile(tf, mode='r') as zipf:
        file_contents= zipf.read(file_to_open).decode("utf-8")
        print(file_contents)

reference from @brice answer.来自@brice 答案的参考。

Pandas provides a shortcut for this, which removes most of the code from the top answer , and allows you to be agnostic about whether your file path is on s3, gcp, or your local machine. Pandas 为此提供了一个快捷方式,它从top answer 中删除了大部分代码,并允许您不知道您的文件路径是在 s3、gcp 还是本地计算机上。

import pandas as pd  

obj = pd.io.parsers.get_filepath_or_buffer(file_path)[0]
with io.BytesIO(obj.read()) as byte_stream:
    # Use your byte stream, to, for example, print file names...
    with zipfile.ZipFile(byte_stream, mode='r') as zipf:
        for subfile in zipf.namelist():
            print(subfile)

Adding on to @brice answer添加到@brice 答案


Here is the code if you want to read any data inside the file line by line如果你想逐行读取文件中的任何数据,这是代码

with zipfile.ZipFile(tf, mode='r') as zipf:
    for line in zipf.read("xyz.csv").split(b"\n"):
        print(line)
        break # to break off after the first line

Hope this helps!希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM