简体   繁体   English

sagemaker python sdk (training jobs)是否继承了边缘节点的所有权限?

[英]Does sagemaker python sdk (training jobs) inherit all permissions from the edge node?

Working in corporate.network on training a machine learning model. The mlflow tracking works ok with a sagemaker notebook instance but when launching a hyper parameter tuning job from the same sagemaker notebook instance, mlflow tracking will fail:在 corporate.network 工作,训练机器学习 model。mlflow 跟踪在 sagemaker notebook 实例上工作正常,但是当从同一个 sagemaker notebook 实例启动超参数调整作业时,mlflow 跟踪将失败:

AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7eff60d845b0>: Failed to establish a new connection: [Errno -2] Name or service not known During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/requests/adapters.py", line 440, in send resp = conn.urlopen( File "/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py", line 813, in urlopen return self.urlopen( [Previous line repeated 2 more times] File "/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py", line 785, in urlopen retries = retries.increment( File "/opt/conda/lib/python3.8/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='mlflow.dev.cor AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7eff60d845b0>: Failed to establish a new connection: [Errno -2] Name or service not known 在上述处理过程中exception,发生另一个异常:Traceback(最近调用 last):文件“/opt/conda/lib/python3.8/site-packages/requests/adapters.py”,第 440 行,在 send resp = conn.urlopen( File “/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py”,第 813 行,在 urlopen 中返回 self.urlopen([前一行重复了 2 次] 文件“/opt/conda/lib/ python3.8/site-packages/urllib3/connectionpool.py", line 785, in urlopen retries = retries.increment( File "/opt/conda/lib/python3.8/site-packages/urllib3/util/retry.py “,第 592 行,递增 raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='mlflow.dev.cor p.net', port=80): Max retr p.net', port=80): Max retr

The mlflow tracking uri does have restrictions on corporate access. mlflow 跟踪 uri 确实对企业访问有限制。 But I don't see why it blocks the sub-instances launched by sagemaker sdk since the IAM role ARN of the training jobs were inherited from the sagemaker notebook instance.但我不明白为什么它会阻止 sagemaker sdk 启动的子实例,因为训练作业的 IAM 角色 ARN 是从 sagemaker 笔记本实例继承的。 Any solutions on it?有什么解决办法吗?

This error isn't related to IAM.此错误与 IAM 无关。 The machine you're running this code from doesn't have.network access to: mlflow.dev.corp.net .您运行此代码的机器没有对mlflow.dev.corp.net的网络访问权限。 And apparently this breaks execution.显然这会破坏执行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM