[英]connect AWS Sagemaker to AWS Glue Data catalog - Glue DevEndpoint
我想将 AWS Sagemaker 笔记本连接到 AWS Glue 数据目录。
我注意到我可以从 Glue DevEndpoint 启动 Sagemaker notebook 或在 Sagemaker 上创建。
此时此刻,我正在使用 Sagemaker 生命周期配置将笔记本从 S3 存储桶导入到 Sagemaker:
#!/bin/bash -xe
set -e
sudo -u ec2-user -i <<'EOF'
source activate python3
pip install sparkmagic
source deactivate
EOF
CP_SAMPLES=true
s3region=s3.amazonaws.com
SRC_NOTEBOOK_DIR=${Bucket}/sagemaker-notebooks
Sagedir=/home/ec2-user/SageMaker
industry=industry
declare -a notebooks=("NB1.ipynb" "NB2.ipynb" "NB3.ipynb")
download_files(){
for notebook in ${!notebooks[@]}; do
aws s3 cp s3://$SRC_NOTEBOOK_DIR/${!notebook}$Sagedir/$industry
done
}
if [ $CP_SAMPLES = true ]; then
sudo -u ec2-user mkdir -p $Sagedir/$industry
mkdir -p $Sagedir/$industry
download_files
chmod -R 755 $Sagedir/$industry
chown -R ec2-user:ec2-user $Sagedir/$industry/.
fi
我正在尝试使用以下脚本访问笔记本中的数据:
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
glueContext = GlueContext(SparkContext.getOrCreate())
persons_DyF = glueContext.create_dynamic_frame.from_catalog(database="database", table_name="table_name")
但它似乎没有 awsglue 模块,我收到下一个错误:
ModuleNotFoundError: No module named 'awsglue'
我使用 GLUE_PYTHON_VERSION: 3 参数创建了一个 1.0 Glue 版本 DevEndpoint。 使用 Dev Endpoint 的角色附加了 AWSGlueServiceRole 托管策略和 AssumeRole 到 Glue 服务信任关系。
当我查看相关的 Sagemaker notebooks 并没有看到任何内容时,我找不到如何将现有 notebook 连接到 Glue DevEndpoint。
有没有办法将现有的 Sagemaker notebook 连接到现有的 Glue DevEndpoint?
当您从 AWS Glue 创建 SageMaker Notebook 实例时,该过程会附加一个生命周期配置,该配置执行一些必要的操作以使用 Glue 开发终端节点。
#!/bin/bash
set -ex
[ -e /home/ec2-user/glue_ready ] && exit 0
mkdir -p /home/ec2-user/glue
cd /home/ec2-user/glue
# Write dev endpoint in a file which will be used by daemon scripts
glue_endpoint_file="/home/ec2-user/glue/glue_endpoint.txt"
if [ -f $glue_endpoint_file ] ; then
rm $glue_endpoint_file
fi
echo "https://glue.eu-west-1.amazonaws.com" >> $glue_endpoint_file
ASSETS=s3://aws-glue-jes-prod-eu-west-1-assets/sagemaker/assets/
aws s3 cp ${ASSETS} . --recursive
bash "/home/ec2-user/glue/Miniconda2-4.5.12-Linux-x86_64.sh" -b -u -p "/home/ec2-user/glue/miniconda"
source "/home/ec2-user/glue/miniconda/bin/activate"
tar -xf autossh-1.4e.tgz
cd autossh-1.4e
./configure
make
sudo make install
sudo cp /home/ec2-user/glue/autossh.conf /etc/init/
mkdir -p /home/ec2-user/.sparkmagic
cp /home/ec2-user/glue/config.json /home/ec2-user/.sparkmagic/config.json
mkdir -p /home/ec2-user/SageMaker/Glue\ Examples
mv /home/ec2-user/glue/notebook-samples/* /home/ec2-user/SageMaker/Glue\ Examples/
# ensure SageMaker notebook has permission for the dev endpoint
aws glue get-dev-endpoint --endpoint-name test --endpoint https://glue.eu-west-1.amazonaws.com
# Run daemons as cron jobs and use flock make sure that daemons are started only iff stopped
(crontab -l; echo "* * * * * /usr/bin/flock -n /tmp/lifecycle-config-v2-dev-endpoint-daemon.lock /usr/bin/sudo /bin/sh /home/ec2-user/glue/lifecycle-config-v2-dev-endpoint-daemon.sh 2>&1 | tee -a /var/log/sagemaker-lifecycle-config-v2-dev-endpoint-daemon.log") | crontab -
(crontab -l; echo "* * * * * /usr/bin/flock -n /tmp/lifecycle-config-reconnect-dev-endpoint-daemon.lock /usr/bin/sudo /bin/sh /home/ec2-user/glue/lifecycle-config-reconnect-dev-endpoint-daemon.sh 2>&1 | tee -a /var/log/sagemaker-lifecycle-config-reconnect-dev-endpoint-daemon.log") | crontab -
CONNECTION_CHECKER_FILE=/home/ec2-user/glue/dev_endpoint_connection_checker.py
if [ -f "$CONNECTION_CHECKER_FILE" ]; then
# wait for async dev endpoint connection to come up
echo "Checking DevEndpoint connection."
python3 $CONNECTION_CHECKER_FILE
fi
source "/home/ec2-user/glue/miniconda/bin/deactivate"
rm -rf "/home/ec2-user/glue/Miniconda2-4.5.12-Linux-x86_64.sh"
sudo touch /home/ec2-user/glue_ready
我建议您将代码作为生命周期配置的一部分包含在内(反之亦然)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.