[英]connect AWS Sagemaker to AWS Glue Data catalog - Glue DevEndpoint
I want to connect AWS Sagemaker notebook to AWS Glue Data Catalog.我想将 AWS Sagemaker 笔记本连接到 AWS Glue 数据目录。
I noticed that I can launch a Sagemaker notebook from the Glue DevEndpoint or create on Sagemaker.我注意到我可以从 Glue DevEndpoint 启动 Sagemaker notebook 或在 Sagemaker 上创建。
At this moment I am using Sagemaker Lifecycle configuration to import The notebooks from S3 bucket to Sagemaker:此时此刻,我正在使用 Sagemaker 生命周期配置将笔记本从 S3 存储桶导入到 Sagemaker:
#!/bin/bash -xe
set -e
sudo -u ec2-user -i <<'EOF'
source activate python3
pip install sparkmagic
source deactivate
EOF
CP_SAMPLES=true
s3region=s3.amazonaws.com
SRC_NOTEBOOK_DIR=${Bucket}/sagemaker-notebooks
Sagedir=/home/ec2-user/SageMaker
industry=industry
declare -a notebooks=("NB1.ipynb" "NB2.ipynb" "NB3.ipynb")
download_files(){
for notebook in ${!notebooks[@]}; do
aws s3 cp s3://$SRC_NOTEBOOK_DIR/${!notebook}$Sagedir/$industry
done
}
if [ $CP_SAMPLES = true ]; then
sudo -u ec2-user mkdir -p $Sagedir/$industry
mkdir -p $Sagedir/$industry
download_files
chmod -R 755 $Sagedir/$industry
chown -R ec2-user:ec2-user $Sagedir/$industry/.
fi
I am trying to access the data from the notebook using the following script:我正在尝试使用以下脚本访问笔记本中的数据:
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
glueContext = GlueContext(SparkContext.getOrCreate())
persons_DyF = glueContext.create_dynamic_frame.from_catalog(database="database", table_name="table_name")
but it seems that it doesn't have awsglue module and I get the next error:但它似乎没有 awsglue 模块,我收到下一个错误:
ModuleNotFoundError: No module named 'awsglue'
I created an 1.0 Glue version DevEndpoint with GLUE_PYTHON_VERSION: 3 argument.我使用 GLUE_PYTHON_VERSION: 3 参数创建了一个 1.0 Glue 版本 DevEndpoint。 The role that using the Dev Endpoint has AWSGlueServiceRole managed policy attached and AssumeRole to Glue service Trust relationship.
使用 Dev Endpoint 的角色附加了 AWSGlueServiceRole 托管策略和 AssumeRole 到 Glue 服务信任关系。
When I am looking at the related Sagemaker notebooks and see none and I can't find how to connect an existing notebook to a Glue DevEndpoint.当我查看相关的 Sagemaker notebooks 并没有看到任何内容时,我找不到如何将现有 notebook 连接到 Glue DevEndpoint。
Is there a way to connect the existing Sagemaker notebook to an existing Glue DevEndpoint?有没有办法将现有的 Sagemaker notebook 连接到现有的 Glue DevEndpoint?
When you create a SageMaker Notebook instance from AWS Glue, the process attaches a lifecycle configuration that performs some necessary actions for working with Glue development endpoints.当您从 AWS Glue 创建 SageMaker Notebook 实例时,该过程会附加一个生命周期配置,该配置执行一些必要的操作以使用 Glue 开发终端节点。
#!/bin/bash
set -ex
[ -e /home/ec2-user/glue_ready ] && exit 0
mkdir -p /home/ec2-user/glue
cd /home/ec2-user/glue
# Write dev endpoint in a file which will be used by daemon scripts
glue_endpoint_file="/home/ec2-user/glue/glue_endpoint.txt"
if [ -f $glue_endpoint_file ] ; then
rm $glue_endpoint_file
fi
echo "https://glue.eu-west-1.amazonaws.com" >> $glue_endpoint_file
ASSETS=s3://aws-glue-jes-prod-eu-west-1-assets/sagemaker/assets/
aws s3 cp ${ASSETS} . --recursive
bash "/home/ec2-user/glue/Miniconda2-4.5.12-Linux-x86_64.sh" -b -u -p "/home/ec2-user/glue/miniconda"
source "/home/ec2-user/glue/miniconda/bin/activate"
tar -xf autossh-1.4e.tgz
cd autossh-1.4e
./configure
make
sudo make install
sudo cp /home/ec2-user/glue/autossh.conf /etc/init/
mkdir -p /home/ec2-user/.sparkmagic
cp /home/ec2-user/glue/config.json /home/ec2-user/.sparkmagic/config.json
mkdir -p /home/ec2-user/SageMaker/Glue\ Examples
mv /home/ec2-user/glue/notebook-samples/* /home/ec2-user/SageMaker/Glue\ Examples/
# ensure SageMaker notebook has permission for the dev endpoint
aws glue get-dev-endpoint --endpoint-name test --endpoint https://glue.eu-west-1.amazonaws.com
# Run daemons as cron jobs and use flock make sure that daemons are started only iff stopped
(crontab -l; echo "* * * * * /usr/bin/flock -n /tmp/lifecycle-config-v2-dev-endpoint-daemon.lock /usr/bin/sudo /bin/sh /home/ec2-user/glue/lifecycle-config-v2-dev-endpoint-daemon.sh 2>&1 | tee -a /var/log/sagemaker-lifecycle-config-v2-dev-endpoint-daemon.log") | crontab -
(crontab -l; echo "* * * * * /usr/bin/flock -n /tmp/lifecycle-config-reconnect-dev-endpoint-daemon.lock /usr/bin/sudo /bin/sh /home/ec2-user/glue/lifecycle-config-reconnect-dev-endpoint-daemon.sh 2>&1 | tee -a /var/log/sagemaker-lifecycle-config-reconnect-dev-endpoint-daemon.log") | crontab -
CONNECTION_CHECKER_FILE=/home/ec2-user/glue/dev_endpoint_connection_checker.py
if [ -f "$CONNECTION_CHECKER_FILE" ]; then
# wait for async dev endpoint connection to come up
echo "Checking DevEndpoint connection."
python3 $CONNECTION_CHECKER_FILE
fi
source "/home/ec2-user/glue/miniconda/bin/deactivate"
rm -rf "/home/ec2-user/glue/Miniconda2-4.5.12-Linux-x86_64.sh"
sudo touch /home/ec2-user/glue_ready
I recommend you include your code as part of that lifecycle configuration (or vice versa).我建议您将代码作为生命周期配置的一部分包含在内(反之亦然)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.