简体繁体 English

我想知道 AWS EMR 发送大量 S3 model 文件的列表和头部请求是否正常

[英]I am wondering if it is normal for AWS EMR to send a lot of list and head requests for S3 model files

原文 2021-05-11 02:25:06 7 1 list/ amazon-s3/ python-requests/ amazon-emr/ head

I am using the AWS EMR Cluster service.我正在使用 AWS EMR 集群服务。 It is a situation in which machine learning tasks such as spark-build are being performed by referring to the model file with the S3 Bucket between EMR Cluster use.这是通过在 EMR Cluster 使用之间使用 S3 Bucket 参考 model 文件来执行 spark-build 等机器学习任务的情况。 I request a lot of head and list requests from S3, but I am wondering if it is normal for AWS EMR to send a lot of list and head requests to the S3 model file.我从 S3 请求了很多头部和列表请求，但我想知道 AWS EMR 向 S3 model 文件发送大量列表和头部请求是否正常。 Symptom: AWS EMR is about 2.7 million head and list requests per day to S3.症状：AWS EMR 每天向 S3 发送大约 270 万个头部和列表请求。

1 个解决方案

A lot of list/head requests get sent.发送了很多列表/头部请求。

This is related to how directories are emulated on the hadoop/spark/hive S3 clients;这与在 hadoop/spark/hive S3 客户端上如何模拟目录有关； every time a progress looks to see if there's a directory on a path it will issue a LIST request, maybe a HEAD request first (to see if its a file).每次进度查看路径上是否有目录时，它都会发出 LIST 请求，可能首先发出 HEAD 请求（查看它是否是文件）。

Then there's the listing of the contents, more LIST requests, and finally reading the files.然后是内容列表，更多 LIST 请求，最后是读取文件。 There'll be one HEAD request on every open() call to verify the file exists and to determine how long it is.每次 open() 调用都会有一个 HEAD 请求来验证文件是否存在并确定它有多长。

Files are read with GET Requests.使用 GET 请求读取文件。 Every time there's a seek()/buffer read on the input stream and the data isn't in a buffer the client has to do one of每次在输入 stream 上读取 seek()/buffer 并且数据不在缓冲区中时，客户端必须执行以下操作之一

read to the end of the current ranged get (assuming its a ranged GET), discarding the data, issue a new ranged GET读取到当前范围获取的末尾（假设它是范围获取），丢弃数据，发出新的范围获取
abort() the HTTPS connection, negotiate a new one. abort() HTTPS 连接，协商一个新的。 Slow.减缓。

Overall then, a lot of IO, especially if the application is inefficient about caching the output of directory listings, whether files exist, doing needless checks before operations ( if fs.exists(path) fs.delete(path, false) ) and the like.总体而言，很多 IO，特别是如果应用程序在缓存目录列表的 output、文件是否存在、在操作前进行不必要的检查（ if fs.exists(path) fs.delete(path, false) ）和喜欢。 If this is your code, try not to do that如果这是您的代码，请尽量不要这样做

(disclaimer: this is all guesses based on the experience of tuning the open source hive/spark apps to work through the S3A connector. I'm assuming the same for EMR) （免责声明：这都是基于调整开源 hive/spark 应用程序以通过 S3A 连接器工作的经验的猜测。我假设 EMR 也是如此）