[英]How to retrieve the output file from Amazon S3 generated from EMR Pyspark back into Flask
I am currently trying to connect my Flask application to Amazon EMR using pyspark.我目前正在尝试使用 pyspark 将我的 Flask 应用程序连接到 Amazon EMR。 I am using the example in AWS ( https://docs.aws.amazon.com/code-samples/latest/catalog/code-catalog-python-example_code-emr.html ) for the pyspark.
我正在使用 AWS 中的示例( https://docs.aws.amazon.com/code-samples/latest/catalog/code-catalog-python-example_code-emr.html )用于 Z77EBB59DCD8956590068 I use the following codes to output the files:
我使用以下代码 output 文件:
df.write.mode('overwrite').csv('s3://my-bucket/output')
df.write.mode('overwrite').csv('s3://my-bucket/output')
The output files from Amazon EMR are stored inside Amazon S3 with the following names: Amazon EMR 中的 output 文件存储在 Amazon S3 中,名称如下:
I would like to read the CSV files into my Flask application.我想将 CSV 文件读入我的 Flask 应用程序。 How am I supposed to read these files since the filenames are different every time?
由于文件名每次都不同,我应该如何阅读这些文件? Is there any smarter way to do it?
有没有更聪明的方法呢?
I assume that you are trying to read them into one dataframe.我假设您正在尝试将它们读入 dataframe。 (Also, since 'part' prefix will be common as per your comment)
(另外,因为根据您的评论,“部分”前缀很常见)
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')
prefix_objs = bucket.objects.filter(Prefix="output/part")
prefix_df = []
for obj in prefix_objs:
try:
key = obj.key
body = obj.get()['Body'].read()
temp = pd.read_csv(io.BytesIO(body),header=None, encoding='utf8',sep=',')
prefix_df.append(temp)
except:
continue
This will read everyfile in your bucket with prefix 'part' in output
folder and add to array.这将读取
output
文件夹中带有前缀“part”的存储桶中的每个文件并添加到数组中。
Afterwards, you can concat it as之后,您可以将其连接为
pd.concat(prefix_df)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.