简体   繁体   English

使用python2.7从Amazon s3读取csv

[英]Read csv from Amazon s3 using python2.7

I can easily get the bucket name from s3 but when I read the csv file from s3, it gives error every time. 我可以轻松地从s3获取存储桶名称但是当我从s3读取csv文件时,它每次都会出错。

import boto3
import pandas as pd

s3 = boto3.client('s3',
         aws_access_key_id='yyyyyyyy',
         aws_secret_access_key='xxxxxxxxxxx')
# Call S3 to list current buckets
response = s3.list_buckets()
for bucket in response['Buckets']:
    print bucket['Name']

output
s3-bucket-data

.

import pandas as pd
import StringIO
from boto.s3.connection import S3Connection

AWS_KEY = 'yyyyyyyyyy'
AWS_SECRET = 'xxxxxxxxxx'
aws_connection = S3Connection(AWS_KEY, AWS_SECRET)
bucket = aws_connection.get_bucket('s3-bucket-data')

fileName = "data.csv"

content = bucket.get_key(fileName).get_contents_as_string()
reader = pd.read_csv(StringIO.StringIO(content))

getting error- 得到错误 -

boto.exception.S3ResponseError: S3ResponseError: 400 Bad Request

How I can read the csv from s3? 我如何从s3读取csv?

you can use s3fs package 你可以使用s3fs

s3fs also supports aws profiles in credential files. s3fs还支持凭证文件中的aws配置文件。

Here is an example (you don't have to chunk it, but i just had this example handy), 这是一个例子(你不必将它分块,但我只是把这个例子放在手边),

import os
import pandas as pd
import s3fs
import gzip

chunksize = 999999
usecols = ["Col1", "Col2"]

filename = 'some_csv_file.csv.gz'
s3_bucket_name = 'some_bucket_name'

AWS_KEY = 'yyyyyyyyyy'
AWS_SECRET = 'xxxxxxxxxx'
s3f = s3fs.S3FileSystem(
    anon=False,
    key=AWS_KEY,
    secret=AWS_SECRET)

# or if you have a profile defined in credentials file:
#aws_shared_credentials_file = 'path/to/aws/credentials/file/'
#os.environ['AWS_SHARED_CREDENTIALS_FILE'] = aws_shared_credentials_file
#s3f = s3fs.S3FileSystem(
#    anon=False,
#    profile_name=s3_profile)

filepath = os.path.join(s3_bucket_name, filename)
with s3f.open(filepath, 'rb') as f:
    gz = gzip.GzipFile(fileobj=f)  # Decompress data with gzip

    chunks = pd.read_csv(gz,
                            usecols=usecols,
                            chunksize=chunksize,
                            iterator=True,
                            )

    df = pd.concat([c for c in chunks], axis=1)

boto is onething I love when it comes to handling data on S3 with python.. 当使用python处理S3上的数据时, boto是我喜欢的。

install boto using pip install boto 安装boto使用pip install boto

import boto
from boto.s3.key import Key

keyId ="your_aws_key_id"
sKeyId="your_aws_secret_key_id"
srcFileName="abc.txt" # filename on S3
destFileName="s3_abc.txt" # output file name
bucketName="mybucket001" # S3 bucket name 

conn = boto.connect_s3(keyId,sKeyId)
bucket = conn.get_bucket(bucketName)

#Get the Key object of the given key, in the bucket
k = Key(bucket,srcFileName)

#Get the contents of the key into a file 
k.get_contents_to_filename(destFileName)

I experienced this issue with a few AWS Regions. 我在几个AWS区域遇到过这个问题。 I created a bucket in "us-east-1" and the following code worked fine: 我在“us-east-1”中创建了一个存储桶,以下代码运行正常:

import boto
from boto.s3.key import Key
import StringIO
import pandas as pd
keyId ="xxxxxxxxxxxxxxxxxx"
sKeyId="yyyyyyyyyyyyyyyyyy"
srcFileName="zzzzz.csv"
bucketName="elasticbeanstalk-us-east-1-aaaaaaaaaaaa"

conn = boto.connect_s3(keyId,sKeyId)
bucket = conn.get_bucket(bucketName)
k = Key(bucket,srcFileName)
content = k.get_contents_as_string()
reader = pd.read_csv(StringIO.StringIO(content))

Try creating a new bucket in us-east-1 and see if it works. 尝试在us-east-1中创建一个新存储桶,看看它是否有效。

Try the following: 请尝试以下方法:

import boto3
from boto3 import session
import pandas as pd
import io

session = boto3.session.Session(region_name='XXXX')
s3client = session.client('s3', config = 
boto3.session.Config(signature_version='XXXX'))
response = s3client.get_object(Bucket='myBucket', Key='myKey')

dataset = pd.read_csv(io.BytesIO(response['Body'].read()), encoding='utf8')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM