from EC2 Spark Python how to access S3 file

Question

I have a s3 file which I am trying to access through Python code. I am submitting my code in an EC2 instance via spark submit. To do the submission I use the following code post starting the master and slave.

 ./spark-submit --py-files /home/usr/spark-1.5.0/sbin/test_1.py

I get the following error: urllib2.HTTPError: HTTP Error 403: Forbidden

In the test_1.py, I calling the S3 file using the following:

import pandas as pd
import numpy as np
import boto

from boto.s3.connection import S3Connection

AWS_KEY = 'XXXXXXDDDDDD'
AWS_SECRET = 'pweqory83743rywiuedq'
aws_connection = S3Connection(AWS_KEY, AWS_SECRET)
bucket = aws_connection.get_bucket('BKT')
for file_key in bucket.list():
   print file_key.name
df = pd.read_csv('https://BKT.s3.amazonaws.com/test_1.csv')

The above code works well in my local machine. However, it is not working in the EC2 instance.

Please let me know if anyone has a solution.

Answer 1

You cannot access the file using the link because the file is private by default in S3. You can change the rights or you can try this:

import pandas as pd
import StringIO
from boto.s3.connection import S3Connection

AWS_KEY = 'XXXXXXDDDDDD'
AWS_SECRET = 'pweqory83743rywiuedq'
aws_connection = S3Connection(AWS_KEY, AWS_SECRET)
bucket = aws_connection.get_bucket('BKT')

fileName = "test_1.csv"

# Saving the file locally and read it.
with open(fileName, 'w+') as writer:
    bucket.get_key(fileName).get_file(writer)

with open(fileName, 'r') as reader:
    reader = pd.read_csv(reader)

# Without saving the file locally.
content = bucket.get_key(fileName).get_contents_as_string()
reader = pd.read_csv(StringIO.StringIO(content))

from EC2 Spark Python how to access S3 file

Question

1 answers

solution1
2 ACCPTED 2016-05-07 09:00:23

from EC2 Spark Python how to access S3 file

Question

1 answers

solution1 2 ACCPTED 2016-05-07 09:00:23

solution1
2 ACCPTED 2016-05-07 09:00:23