I am currently trying to connect my Flask application to Amazon EMR using pyspark. I am using the example in AWS ( https://docs.aws.amazon.com/code-samples/latest/catalog/code-catalog-python-example_code-emr.html ) for the pyspark. I use the following codes to output the files:
df.write.mode('overwrite').csv('s3://my-bucket/output')
The output files from Amazon EMR are stored inside Amazon S3 with the following names:
I would like to read the CSV files into my Flask application. How am I supposed to read these files since the filenames are different every time? Is there any smarter way to do it?
I assume that you are trying to read them into one dataframe. (Also, since 'part' prefix will be common as per your comment)
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')
prefix_objs = bucket.objects.filter(Prefix="output/part")
prefix_df = []
for obj in prefix_objs:
try:
key = obj.key
body = obj.get()['Body'].read()
temp = pd.read_csv(io.BytesIO(body),header=None, encoding='utf8',sep=',')
prefix_df.append(temp)
except:
continue
This will read everyfile in your bucket with prefix 'part' in output
folder and add to array.
Afterwards, you can concat it as
pd.concat(prefix_df)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.