从 Python 中的 AWS S3 读取 gzip 文件的内容

Question

我正在尝试从我在 AWS 中运行的 Hadoop 进程中读取一些日志。 日志存储在 S3 文件夹中并具有以下路径。

bucketname = name key = y/z/stderr.gz 这里 Y 是集群 ID，z 是文件夹名称。 这两者都充当 AWS 中的文件夹（对象）。 所以完整路径就像 x/y/z/stderr.gz。

现在我想解压 this.gz 文件并读取文件的内容。 我不想将此文件下载到我的系统，希望将内容保存在 python 变量中。

这是我到目前为止所尝试的。

bucket_name = "name"
key = "y/z/stderr.gz"
obj = s3.Object(bucket_name,key)
n = obj.get()['Body'].read()

这给了我一种不可读的格式。 我也试过

n = obj.get()['Body'].read().decode('utf-8')

这给出了错误utf8' codec can't decode byte 0x8b in position 1: invalid start byte。

我也试过

gzip = StringIO(obj)
gzipfile = gzip.GzipFile(fileobj=gzip)
content = gzipfile.read()

这将返回错误IOError: Not a gzipped file

不确定如何解码 this.gz 文件。

编辑 - 找到解决方案。 需要在其中传递 n 并使用 BytesIO

gzip = BytesIO(n)

Answer 1

这是旧的，但你不再需要它中间的 BytesIO 对象（至少在我的boto3==1.9.223和python3.7 ）

import boto3
import gzip

s3 = boto3.resource("s3")
obj = s3.Object("YOUR_BUCKET_NAME", "path/to/your_key.gz")
with gzip.GzipFile(fileobj=obj.get()["Body"]) as gzipfile:
    content = gzipfile.read()
print(content)

Answer 2

@Amit，我试图做同样的事情来测试解码文件，并通过一些修改让您的代码运行。 我只需要删除函数 def、return 并重命名 gzip 变量，因为该名称正在使用中。

import json
import boto3
from io import BytesIO
import gzip

try:
     s3 = boto3.resource('s3')
     key='YOUR_FILE_NAME.gz'
     obj = s3.Object('YOUR_BUCKET_NAME',key)
     n = obj.get()['Body'].read()
     gzipfile = BytesIO(n)
     gzipfile = gzip.GzipFile(fileobj=gzipfile)
     content = gzipfile.read()
     print(content)
except Exception as e:
    print(e)
    raise e

Answer 3

您可以使用 AWS S3 SELECT Object Content读取 gzip 内容

S3 Select 是一项 Amazon S3 功能，旨在仅从对象中提取您需要的数据，这可以显着提高性能并降低需要访问 S3 中数据的应用程序的成本。

Amazon S3 Select 适用于以 Apache Parquet 格式存储的对象、JSON 数组以及用于 CSV 和 JSON 对象的 BZIP2 压缩。

参考： https : //docs.aws.amazon.com/AmazonS3/latest/dev/selecting-content-from-objects.html

from io import StringIO
import boto3
import pandas as pd

bucket = 'my-bucket'
prefix = 'my-prefix'

client = boto3.client('s3')

for object in client.list_objects_v2(Bucket=bucket, Prefix=prefix)['Contents']:
    if object['Size'] <= 0:
        continue

    print(object['Key'])
    r = client.select_object_content(
            Bucket=bucket,
            Key=object['Key'],
            ExpressionType='SQL',
            Expression="select * from s3object",
            InputSerialization = {'CompressionType': 'GZIP', 'JSON': {'Type': 'DOCUMENT'}},
            OutputSerialization = {'CSV': {'QuoteFields': 'ASNEEDED', 'RecordDelimiter': '\n', 'FieldDelimiter': ',', 'QuoteCharacter': '"', 'QuoteEscapeCharacter': '"'}},
        )

    for event in r['Payload']:
        if 'Records' in event:
            records = event['Records']['Payload'].decode('utf-8')
            payloads = (''.join(r for r in records))
            try:
                select_df = pd.read_csv(StringIO(payloads), error_bad_lines=False)
                for row in select_df.iterrows():
                    print(row)
            except Exception as e:
                print(e)

Answer 4

在python中从aws s3读取Bz2扩展文件

import json
import boto3
from io import BytesIO
import bz2
try:
    s3 = boto3.resource('s3')
    key='key_name.bz2'
    obj = s3.Object('bucket_name',key)
    nn = obj.get()['Body'].read()
    gzipfile = BytesIO(nn)
    content = bz2.decompress(gzipfile.read())
    content = content.split('\n')
    print len(content)

except Exception as e:
    print(e)

Answer 5

就像我们对变量所做的一样，当我们使用 io 模块的 Byte IO 操作时，数据可以作为字节保存在内存缓冲区中。

这是一个示例程序来演示这一点：

mport io

stream_str = io.BytesIO(b"JournalDev Python: \x00\x01")
print(stream_str.getvalue())

getvalue()函数将 Buffer 中的值作为字符串。

所以，@Jean-FrançoisFabre 答案是正确的，你应该使用

gzip = BytesIO(n)

有关更多信息，请阅读以下文档：

https://docs.python.org/3/library/io.html

Answer 6

目前该文件可以被读取为

import pandas as pd
role = 'role name'
bucket = 'bucket name'
data_key = 'data key'
data_location = 's3://{}/{}'.format(bucket, data_key)
data = pd.read_csv(data_location,compression='gzip', header=0, sep=',', quotechar='"')

Answer 7

我还坚持从 s3 读取 gzipped csv 文件的内容，得到了同样的错误，但最终找到了一种方法来读取gzip.GZipFile并使用csv.reader遍历它的行：

for obj in bucket.objects.filter(Prefix=folder_prefix):
    if obj.key.endswith(".gz"):
        with gzip.GzipFile(fileobj=obj.get()["Body"]) as gzipped_csv_file:
            csv_reader = csv.reader(StringIO(gzipped_csv_file.read().decode()))
            for line in csv_reader:
                process_line(line)

从 Python 中的 AWS S3 读取 gzip 文件的内容

问题描述

7 个解决方案

解决方案1
27 2020-01-07 19:59:03

解决方案2
17 2018-12-04 21:15:44

解决方案3
9 2019-06-03 08:49:57

解决方案4
1 2019-03-04 14:06:24

解决方案5
0 2018-09-28 13:16:38

解决方案6
0 2019-08-15 18:56:25

解决方案7
0 2022-11-17 16:20:01

从 Python 中的 AWS S3 读取 gzip 文件的内容

问题描述

7 个解决方案

解决方案1 27 2020-01-07 19:59:03

解决方案2 17 2018-12-04 21:15:44

解决方案3 9 2019-06-03 08:49:57

解决方案4 1 2019-03-04 14:06:24

解决方案5 0 2018-09-28 13:16:38

解决方案6 0 2019-08-15 18:56:25

解决方案7 0 2022-11-17 16:20:01

解决方案1
27 2020-01-07 19:59:03

解决方案2
17 2018-12-04 21:15:44

解决方案3
9 2019-06-03 08:49:57

解决方案4
1 2019-03-04 14:06:24

解决方案5
0 2018-09-28 13:16:38

解决方案6
0 2019-08-15 18:56:25

解决方案7
0 2022-11-17 16:20:01