Google Cloud Platform Vertex AI logs not showing in custom job

Question

I have written a python package that trains a neural network. I then package it up using the below command.

python3 setup.py sdist --formats=gztar

When I run this job through the GCP console, and manually click through all the options, I get logs from my program as expected (see example below)

Example successful logs:

However, when I run the exact same job programmatically, no logs appear. Only the final error (if one occurs):

Example logs missing:

In both cases, the program is running - I just cant see any of the outputs. What could the reason for this be? For reference, I have also included the code I used to programmatically start the training process:

ENTRY_POINT = "projects.yaw_correction.yaw_correction"
TIMESTAMP = datetime.datetime.strftime(datetime.datetime.now(),"%y%m%d_%H%M%S")
PROJECT = "yaw_correction"
GCP_PROJECT = "our_gcp_project_name"
LOCATION = "europe-west1"
BUCKET_NAME = "our_bucket_name"
DISPLAY_NAME = "Training_Job_" + TIMESTAMP
CONTAINER_URI = "europe-docker.pkg.dev/vertex-ai/training/pytorch-xla.1-9:latest"
MODEL_NAME = "Model_" + TIMESTAMP
ARGS = [f"/gcs/fotokite-training-data/yaw_correction/", "--cloud", "--gpu"]
TENSORBOARD = "projects/"our_gcp_project_name"/locations/europe-west4/tensorboards/yaw_correction"

MACHINE_TYPE = "n1-standard-4"
REPLICA_COUNT = 1
ACCELERATOR_TYPE = "ACCELERATOR_TYPE_UNSPECIFIED"
ACCELERATOR_COUNT = 0
SYNC = False

#Delete existing source distributions
def deleteDist():
    dirpath = Path('dist')
    if dirpath.exists() and dirpath.is_dir():
        shutil.rmtree(dirpath)

# Copy distribution to the cloud bucket storage
deleteDist()
subprocess.run("python3 setup.py sdist --formats=gztar", shell=True)
filename = [x for x in Path('dist').glob('*')]
if len(filename) != 1:
    raise Exception("More than one distribution was found")
print(str(filename[0]))
PACKAGE_URI = f"gs://{BUCKET_NAME}/distributions/"
subprocess.run(f"gsutil cp {str(filename[0])} {PACKAGE_URI}", shell=True)
PACKAGE_URI += str(filename[0].name)
deleteDist()

# Initialise the compute instance
aiplatform.init(project=GCP_PROJECT, location=LOCATION, staging_bucket=BUCKET_NAME)

# Schedule the job
job = aiplatform.CustomPythonPackageTrainingJob(
    display_name=DISPLAY_NAME,
    #script_path="trainer/test.py",
    python_package_gcs_uri=PACKAGE_URI,
    python_module_name=ENTRY_POINT,
    #requirements=['tensorflow_datasets~=4.2.0', 'SQLAlchemy~=1.4.26', 'google-cloud-secret-manager~=2.7.2', 'cloud-sql-python-connector==0.4.2', 'PyMySQL==1.0.2'],
    container_uri=CONTAINER_URI,
)

model = job.run(
    dataset=None,
    #base_output_dir=f"gs://{BUCKET_NAME}/{PROJECT}/Train_{TIMESTAMP}",
    base_output_dir=f"gs://{BUCKET_NAME}/{PROJECT}/",
    service_account="vertex-ai-fotokite-service-acc@fotokite-cv-gcp-exploration.iam.gserviceaccount.com",
    environment_variables=None,
    args=ARGS,
    replica_count=REPLICA_COUNT,
    machine_type=MACHINE_TYPE,
    accelerator_type=ACCELERATOR_TYPE,
    accelerator_count=ACCELERATOR_TYPE,
    #tensorboard=TENSORBOARD,
    sync=SYNC
)
print(model)
print("JOB SUBMITTED")

Answer 1

Regularly this kind of error "The replica workerpool0-0 exited with a non-zero status of 1" is because something is wrong in the process of package the python file or in the code.

You can see these options.

You could check if all the files are in the package(training files and dependencies) like in this example:

 setup.py demo/PKG demo/SOURCES.txt demo/dependency_links.txt demo/requires.txt demo/level.txt trainer/__init__.py trainer/metadata.py trainer/model.py trainer/task.py trainer/utils.py

You could see the official troubleshooting guide from Google Cloud with this type of error and how to see more information about this error.
You can see this oficial documentation about packaging .

Google Cloud Platform Vertex AI logs not showing in custom job

Question

1 answers

solution1
0 2021-11-12 21:07:34

Google Cloud Platform Vertex AI logs not showing in custom job

Question

1 answers

solution1 0 2021-11-12 21:07:34

solution1
0 2021-11-12 21:07:34