简体   繁体   中英

Google Cloud Platform Vertex AI logs not showing in custom job

I have written a python package that trains a neural network. I then package it up using the below command.

python3 setup.py sdist --formats=gztar

When I run this job through the GCP console, and manually click through all the options, I get logs from my program as expected (see example below)

Example successful logs: 在此处输入图片说明

However, when I run the exact same job programmatically, no logs appear. Only the final error (if one occurs):

Example logs missing: 在此处输入图片说明

In both cases, the program is running - I just cant see any of the outputs. What could the reason for this be? For reference, I have also included the code I used to programmatically start the training process:

ENTRY_POINT = "projects.yaw_correction.yaw_correction"
TIMESTAMP = datetime.datetime.strftime(datetime.datetime.now(),"%y%m%d_%H%M%S")
PROJECT = "yaw_correction"
GCP_PROJECT = "our_gcp_project_name"
LOCATION = "europe-west1"
BUCKET_NAME = "our_bucket_name"
DISPLAY_NAME = "Training_Job_" + TIMESTAMP
CONTAINER_URI = "europe-docker.pkg.dev/vertex-ai/training/pytorch-xla.1-9:latest"
MODEL_NAME = "Model_" + TIMESTAMP
ARGS = [f"/gcs/fotokite-training-data/yaw_correction/", "--cloud", "--gpu"]
TENSORBOARD = "projects/"our_gcp_project_name"/locations/europe-west4/tensorboards/yaw_correction"

MACHINE_TYPE = "n1-standard-4"
REPLICA_COUNT = 1
ACCELERATOR_TYPE = "ACCELERATOR_TYPE_UNSPECIFIED"
ACCELERATOR_COUNT = 0
SYNC = False

#Delete existing source distributions
def deleteDist():
    dirpath = Path('dist')
    if dirpath.exists() and dirpath.is_dir():
        shutil.rmtree(dirpath)

# Copy distribution to the cloud bucket storage
deleteDist()
subprocess.run("python3 setup.py sdist --formats=gztar", shell=True)
filename = [x for x in Path('dist').glob('*')]
if len(filename) != 1:
    raise Exception("More than one distribution was found")
print(str(filename[0]))
PACKAGE_URI = f"gs://{BUCKET_NAME}/distributions/"
subprocess.run(f"gsutil cp {str(filename[0])} {PACKAGE_URI}", shell=True)
PACKAGE_URI += str(filename[0].name)
deleteDist()

# Initialise the compute instance
aiplatform.init(project=GCP_PROJECT, location=LOCATION, staging_bucket=BUCKET_NAME)

# Schedule the job
job = aiplatform.CustomPythonPackageTrainingJob(
    display_name=DISPLAY_NAME,
    #script_path="trainer/test.py",
    python_package_gcs_uri=PACKAGE_URI,
    python_module_name=ENTRY_POINT,
    #requirements=['tensorflow_datasets~=4.2.0', 'SQLAlchemy~=1.4.26', 'google-cloud-secret-manager~=2.7.2', 'cloud-sql-python-connector==0.4.2', 'PyMySQL==1.0.2'],
    container_uri=CONTAINER_URI,
)

model = job.run(
    dataset=None,
    #base_output_dir=f"gs://{BUCKET_NAME}/{PROJECT}/Train_{TIMESTAMP}",
    base_output_dir=f"gs://{BUCKET_NAME}/{PROJECT}/",
    service_account="vertex-ai-fotokite-service-acc@fotokite-cv-gcp-exploration.iam.gserviceaccount.com",
    environment_variables=None,
    args=ARGS,
    replica_count=REPLICA_COUNT,
    machine_type=MACHINE_TYPE,
    accelerator_type=ACCELERATOR_TYPE,
    accelerator_count=ACCELERATOR_TYPE,
    #tensorboard=TENSORBOARD,
    sync=SYNC
)
print(model)
print("JOB SUBMITTED")

Regularly this kind of error "The replica workerpool0-0 exited with a non-zero status of 1" is because something is wrong in the process of package the python file or in the code.

You can see these options.

  • You could check if all the files are in the package(training files and dependencies) like in this example:
 setup.py demo/PKG demo/SOURCES.txt demo/dependency_links.txt demo/requires.txt demo/level.txt trainer/__init__.py trainer/metadata.py trainer/model.py trainer/task.py trainer/utils.py

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM