[英]Python S3 download zip file
I have zip files uploaded to S3.我有上传到 S3 的 zip 文件。 I'd like to download them for processing.
我想下载它们进行处理。 I don't need to permanently store them, but I need to temporarily process them.
我不需要永久存储它们,但我需要临时处理它们。 How would I go about doing this?
我该怎么做呢?
Because working software > comprehensive documentation :因为工作软件 > 综合文档:
import zipfile
import boto
import io
# Connect to s3
# This will need your s3 credentials to be set up
# with `aws configure` using the aws CLI.
#
# See: https://aws.amazon.com/cli/
conn = boto.s3.connect_s3()
# get hold of the bucket
bucket = conn.get_bucket("my_bucket_name")
# Get hold of a given file
key = boto.s3.key.Key(bucket)
key.key = "my_s3_object_key"
# Create an in-memory bytes IO buffer
with io.BytesIO() as b:
# Read the file into it
key.get_file(b)
# Reset the file pointer to the beginning
b.seek(0)
# Read the file as a zipfile and process the members
with zipfile.ZipFile(b, mode='r') as zipf:
for subfile in zipf.namelist():
do_stuff_with_subfile()
import zipfile
import boto3
import io
# this is just to demo. real use should use the config
# environment variables or config file.
#
# See: http://boto3.readthedocs.org/en/latest/guide/configuration.html
session = boto3.session.Session(
aws_access_key_id="ACCESSKEY",
aws_secret_access_key="SECRETKEY"
)
s3 = session.resource("s3")
bucket = s3.Bucket('stackoverflow-brice-test')
obj = bucket.Object('smsspamcollection.zip')
with io.BytesIO(obj.get()["Body"].read()) as tf:
# rewind the file
tf.seek(0)
# Read the file as a zipfile and process the members
with zipfile.ZipFile(tf, mode='r') as zipf:
for subfile in zipf.namelist():
print(subfile)
Tested on MacOSX with Python3.在 MacOSX 上用 Python3 测试。
If speed is a concern, a good approach would be to choose an EC2 instance fairly close to your S3 bucket (in the same region) and use that instance to unzip/process your zipped files.如果速度是一个问题,一个好的方法是选择一个离您的 S3 存储桶(在同一区域)相当近的 EC2 实例,并使用该实例来解压缩/处理您的压缩文件。
This will allow for a latency reduction and allow you to process them fairly efficiently.这将允许减少延迟并允许您相当有效地处理它们。 You can remove each extracted file after finishing your work.
完成工作后,您可以删除每个提取的文件。
Note: This will only work if you are fine using EC2 instances.注意:这仅适用于您可以正常使用 EC2 实例的情况。
I believe you have heard boto
which is Python interface to Amazon Web Services
我相信你听说过
boto
,它是Python interface to Amazon Web Services
You can get key
from s3
to file
.您可以从
s3
获取key
到file
。
import boto
import zipfile.ZipFile as ZipFile
s3 = boto.connect_s3() # connect
bucket = s3.get_bucket(bucket_name) # get bucket
key = bucket.get_key(key_name) # get key (the file in s3)
key.get_file(local_name) # set this to temporal file
with ZipFile(local_name, 'r') as myzip:
# do something with myzip
os.unlink(local_name) # delete it
You can also use tempfile
.您也可以使用
tempfile
。 For more detail, see create & read from tempfile有关更多详细信息,请参阅从临时文件创建和读取
import boto3
import os
import zipfile
import io
import json
'''
When you configure awscli, you\'ll set up a credentials file located at
~/.aws/credentials. By default, this file will be used by Boto3 to authenticate.
'''
os.environ['AWS_PROFILE'] = "<profile_name>"
os.environ['AWS_DEFAULT_REGION'] = "<region_name>"
# Let's use Amazon S3
s3_name = "<bucket_name>"
zip_file_name = "<zip_file_name>"
file_to_open = "<file_to_open>"
s3 = boto3.resource('s3')
obj = s3.Object(s3_name, zip_file_name )
with io.BytesIO(obj.get()["Body"].read()) as tf:
# rewind the file
tf.seek(0)
# Read the file as a zipfile and process the members
with zipfile.ZipFile(tf, mode='r') as zipf:
file_contents= zipf.read(file_to_open).decode("utf-8")
print(file_contents)
reference from @brice answer.来自@brice 答案的参考。
Pandas provides a shortcut for this, which removes most of the code from the top answer , and allows you to be agnostic about whether your file path is on s3, gcp, or your local machine. Pandas 为此提供了一个快捷方式,它从top answer 中删除了大部分代码,并允许您不知道您的文件路径是在 s3、gcp 还是本地计算机上。
import pandas as pd
obj = pd.io.parsers.get_filepath_or_buffer(file_path)[0]
with io.BytesIO(obj.read()) as byte_stream:
# Use your byte stream, to, for example, print file names...
with zipfile.ZipFile(byte_stream, mode='r') as zipf:
for subfile in zipf.namelist():
print(subfile)
Adding on to @brice answer添加到@brice 答案
Here is the code if you want to read any data inside the file line by line如果你想逐行读取文件中的任何数据,这是代码
with zipfile.ZipFile(tf, mode='r') as zipf:
for line in zipf.read("xyz.csv").split(b"\n"):
print(line)
break # to break off after the first line
Hope this helps!希望这可以帮助!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.