[英]Slow reading from AWS S3 bucket
I'm trying to read a file with pandas from an s3 bucket without downloading the file to the disk.我正在尝试从 s3 存储桶中读取带有 Pandas 的文件,而不将该文件下载到磁盘。 I've tried to use boto3 for that as
我尝试使用 boto3 作为
import boto3
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket_name', Key="key")
read_file = io.BytesIO(obj['Body'].read())
pd.read_csv(read_file)
And also I've tried s3fs as而且我也试过 s3fs 作为
import s3fs
import pandas as pd
fs = s3fs.S3FileSystem(anon=False)
with fs.open('bucket_name/path/to/file.csv', 'rb') as f:
df = pd.read_csv(f)`
The issue is it takes too long to read the file.问题是读取文件需要很长时间。 It takes about 3 minutes to read 38MB file.
读取 38MB 文件大约需要 3 分钟。 Is it supposed to be like that?
它应该是这样的吗? If it is, then is there any faster way to do the same.
如果是,那么有没有更快的方法来做同样的事情。 If it's not, any suggestions what might cause the issue?
如果不是,有什么建议可能导致问题?
Thanks!谢谢!
Based on this answer to a similar issue, you might want to consider what region the bucket you're reading from is in, compared to where you're reading it from.基于此对类似问题的回答,您可能需要考虑与读取数据的位置相比,您正在读取的存储桶所在的区域。 Might be a simple change (assuming you have control over the buckets location) which could improve the performance drastically.
可能是一个简单的更改(假设您可以控制存储桶的位置),它可以显着提高性能。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.