简体   繁体   中英

How to read content from the s3 bucket as url

I have s3 bucket url is below

s3_filename is s3://xx/xx/y/z/ion.csv

if its is bucket i can read like below code

def read_s3(bucket, key):
        s3 = boto3.client('s3')
        obj = s3.get_object(Bucket=bucket, Key=key)
        df = pd.read_csv(obj['Body'])
        return df

Since you appear to be using Pandas, please note that it actually uses s3fs under the cover. So, if your install is relatively recent and standard, you may directly do:

df = pd.read_csv(s3_path)

If you have some specific config for your bucket, for example special credentials, KMS encryption, etc., you may use an explicitly configured s3fs filesystem, for example:

fs = s3fs.S3FileSystem(
    key=my_aws_access_key_id,
    secret=my_aws_secret_access_key,
    s3_additional_kwargs={
            'ServerSideEncryption': 'aws:kms',
            'SSEKMSKeyId': my_kms_key,
    },
)
# note: KMS encryption only used when writing; when reading, it is automatic if you have access

with fs.open(s3_path, 'r') as f:
    df = pd.read_csv(f)

# here we write the same df at a different location, making sure
# it is using my_kms_key:
with fs.open(out_s3_path, 'w') as f:
    df.to_csv(f)

That said, if you are really interested to deal yourself with getting the object, and the question is just about how to remove a potential s3:// prefix and then split bucket/key , you could simply use:

bucket, key = re.sub(r'^s3://', '', s3_path).split('/', 1)

But that may miss more general cases and conventions handled by systems such as awscli or the very s3fs referenced above.

For more generality, you can take a look at how they do this in awscli . In general, doing so often provides a good indication of whether or not some functionality may already be built in boto3 or botocore . In this case however, it would appear not (looking at a local clone of release-1.18.126). They simply do this from first principles: see awscli.customizations.s3.utils.split_s3_bucket_key as it is implemented here .

From the regex that is eventually used in that code, you can infer that the kind of cases awscli allows for s3_path is quite diverse indeed:

_S3_ACCESSPOINT_TO_BUCKET_KEY_REGEX = re.compile(
    r'^(?P<bucket>arn:(aws).*:s3:[a-z\-0-9]+:[0-9]{12}:accesspoint[:/][^/]+)/?'
    r'(?P<key>.*)$'
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM