简体   繁体   中英

from EC2 Spark Python how to access S3 file

I have a s3 file which I am trying to access through Python code. I am submitting my code in an EC2 instance via spark submit. To do the submission I use the following code post starting the master and slave.

 ./spark-submit --py-files /home/usr/spark-1.5.0/sbin/test_1.py

I get the following error: urllib2.HTTPError: HTTP Error 403: Forbidden

In the test_1.py, I calling the S3 file using the following:

import pandas as pd
import numpy as np
import boto

from boto.s3.connection import S3Connection

AWS_KEY = 'XXXXXXDDDDDD'
AWS_SECRET = 'pweqory83743rywiuedq'
aws_connection = S3Connection(AWS_KEY, AWS_SECRET)
bucket = aws_connection.get_bucket('BKT')
for file_key in bucket.list():
   print file_key.name
df = pd.read_csv('https://BKT.s3.amazonaws.com/test_1.csv')

The above code works well in my local machine. However, it is not working in the EC2 instance.

Please let me know if anyone has a solution.

You cannot access the file using the link because the file is private by default in S3. You can change the rights or you can try this:

import pandas as pd
import StringIO
from boto.s3.connection import S3Connection

AWS_KEY = 'XXXXXXDDDDDD'
AWS_SECRET = 'pweqory83743rywiuedq'
aws_connection = S3Connection(AWS_KEY, AWS_SECRET)
bucket = aws_connection.get_bucket('BKT')

fileName = "test_1.csv"

# Saving the file locally and read it.
with open(fileName, 'w+') as writer:
    bucket.get_key(fileName).get_file(writer)

with open(fileName, 'r') as reader:
    reader = pd.read_csv(reader)

# Without saving the file locally.
content = bucket.get_key(fileName).get_contents_as_string()
reader = pd.read_csv(StringIO.StringIO(content))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM