简体   繁体   中英

connect AWS Sagemaker to AWS Glue Data catalog - Glue DevEndpoint

I want to connect AWS Sagemaker notebook to AWS Glue Data Catalog.

I noticed that I can launch a Sagemaker notebook from the Glue DevEndpoint or create on Sagemaker.

At this moment I am using Sagemaker Lifecycle configuration to import The notebooks from S3 bucket to Sagemaker:

#!/bin/bash -xe
set -e
sudo -u ec2-user -i <<'EOF'
source activate python3
pip install sparkmagic
source deactivate
EOF
CP_SAMPLES=true
s3region=s3.amazonaws.com
SRC_NOTEBOOK_DIR=${Bucket}/sagemaker-notebooks
Sagedir=/home/ec2-user/SageMaker
industry=industry
declare -a notebooks=("NB1.ipynb" "NB2.ipynb" "NB3.ipynb")
download_files(){
   for notebook in ${!notebooks[@]}; do
      aws s3 cp s3://$SRC_NOTEBOOK_DIR/${!notebook}$Sagedir/$industry
   done
}
if [ $CP_SAMPLES = true ]; then
   sudo -u ec2-user mkdir -p $Sagedir/$industry
   mkdir -p $Sagedir/$industry
   download_files
   chmod -R 755 $Sagedir/$industry
   chown -R ec2-user:ec2-user $Sagedir/$industry/.
fi

I am trying to access the data from the notebook using the following script:

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
 
glueContext = GlueContext(SparkContext.getOrCreate())
 
persons_DyF = glueContext.create_dynamic_frame.from_catalog(database="database", table_name="table_name")

but it seems that it doesn't have awsglue module and I get the next error:

ModuleNotFoundError: No module named 'awsglue'

I created an 1.0 Glue version DevEndpoint with GLUE_PYTHON_VERSION: 3 argument. The role that using the Dev Endpoint has AWSGlueServiceRole managed policy attached and AssumeRole to Glue service Trust relationship.

When I am looking at the related Sagemaker notebooks and see none and I can't find how to connect an existing notebook to a Glue DevEndpoint.

Is there a way to connect the existing Sagemaker notebook to an existing Glue DevEndpoint?

When you create a SageMaker Notebook instance from AWS Glue, the process attaches a lifecycle configuration that performs some necessary actions for working with Glue development endpoints.

#!/bin/bash
set -ex
[ -e /home/ec2-user/glue_ready ] && exit 0

mkdir -p /home/ec2-user/glue
cd /home/ec2-user/glue

# Write dev endpoint in a file which will be used by daemon scripts
glue_endpoint_file="/home/ec2-user/glue/glue_endpoint.txt"

if [ -f $glue_endpoint_file ] ; then
    rm $glue_endpoint_file
fi
echo "https://glue.eu-west-1.amazonaws.com" >> $glue_endpoint_file

ASSETS=s3://aws-glue-jes-prod-eu-west-1-assets/sagemaker/assets/

aws s3 cp ${ASSETS} . --recursive

bash "/home/ec2-user/glue/Miniconda2-4.5.12-Linux-x86_64.sh" -b -u -p "/home/ec2-user/glue/miniconda"

source "/home/ec2-user/glue/miniconda/bin/activate"

tar -xf autossh-1.4e.tgz
cd autossh-1.4e
./configure
make
sudo make install
sudo cp /home/ec2-user/glue/autossh.conf /etc/init/

mkdir -p /home/ec2-user/.sparkmagic
cp /home/ec2-user/glue/config.json /home/ec2-user/.sparkmagic/config.json

mkdir -p /home/ec2-user/SageMaker/Glue\ Examples
mv /home/ec2-user/glue/notebook-samples/* /home/ec2-user/SageMaker/Glue\ Examples/

# ensure SageMaker notebook has permission for the dev endpoint
aws glue get-dev-endpoint --endpoint-name test --endpoint https://glue.eu-west-1.amazonaws.com

# Run daemons as cron jobs and use flock make sure that daemons are started only iff stopped
(crontab -l; echo "* * * * * /usr/bin/flock -n /tmp/lifecycle-config-v2-dev-endpoint-daemon.lock /usr/bin/sudo /bin/sh /home/ec2-user/glue/lifecycle-config-v2-dev-endpoint-daemon.sh 2>&1 | tee -a /var/log/sagemaker-lifecycle-config-v2-dev-endpoint-daemon.log") | crontab -

(crontab -l; echo "* * * * * /usr/bin/flock -n /tmp/lifecycle-config-reconnect-dev-endpoint-daemon.lock /usr/bin/sudo /bin/sh /home/ec2-user/glue/lifecycle-config-reconnect-dev-endpoint-daemon.sh 2>&1 | tee -a /var/log/sagemaker-lifecycle-config-reconnect-dev-endpoint-daemon.log") | crontab -

CONNECTION_CHECKER_FILE=/home/ec2-user/glue/dev_endpoint_connection_checker.py
if [ -f "$CONNECTION_CHECKER_FILE" ]; then
    # wait for async dev endpoint connection to come up
    echo "Checking DevEndpoint connection."
    python3 $CONNECTION_CHECKER_FILE
fi

source "/home/ec2-user/glue/miniconda/bin/deactivate"

rm -rf "/home/ec2-user/glue/Miniconda2-4.5.12-Linux-x86_64.sh"

sudo touch /home/ec2-user/glue_ready

I recommend you include your code as part of that lifecycle configuration (or vice versa).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM