I have written a python package that trains a neural network. I then package it up using the below command.
python3 setup.py sdist --formats=gztar
When I run this job through the GCP console, and manually click through all the options, I get logs from my program as expected (see example below)
However, when I run the exact same job programmatically, no logs appear. Only the final error (if one occurs):
In both cases, the program is running - I just cant see any of the outputs. What could the reason for this be? For reference, I have also included the code I used to programmatically start the training process:
ENTRY_POINT = "projects.yaw_correction.yaw_correction"
TIMESTAMP = datetime.datetime.strftime(datetime.datetime.now(),"%y%m%d_%H%M%S")
PROJECT = "yaw_correction"
GCP_PROJECT = "our_gcp_project_name"
LOCATION = "europe-west1"
BUCKET_NAME = "our_bucket_name"
DISPLAY_NAME = "Training_Job_" + TIMESTAMP
CONTAINER_URI = "europe-docker.pkg.dev/vertex-ai/training/pytorch-xla.1-9:latest"
MODEL_NAME = "Model_" + TIMESTAMP
ARGS = [f"/gcs/fotokite-training-data/yaw_correction/", "--cloud", "--gpu"]
TENSORBOARD = "projects/"our_gcp_project_name"/locations/europe-west4/tensorboards/yaw_correction"
MACHINE_TYPE = "n1-standard-4"
REPLICA_COUNT = 1
ACCELERATOR_TYPE = "ACCELERATOR_TYPE_UNSPECIFIED"
ACCELERATOR_COUNT = 0
SYNC = False
#Delete existing source distributions
def deleteDist():
dirpath = Path('dist')
if dirpath.exists() and dirpath.is_dir():
shutil.rmtree(dirpath)
# Copy distribution to the cloud bucket storage
deleteDist()
subprocess.run("python3 setup.py sdist --formats=gztar", shell=True)
filename = [x for x in Path('dist').glob('*')]
if len(filename) != 1:
raise Exception("More than one distribution was found")
print(str(filename[0]))
PACKAGE_URI = f"gs://{BUCKET_NAME}/distributions/"
subprocess.run(f"gsutil cp {str(filename[0])} {PACKAGE_URI}", shell=True)
PACKAGE_URI += str(filename[0].name)
deleteDist()
# Initialise the compute instance
aiplatform.init(project=GCP_PROJECT, location=LOCATION, staging_bucket=BUCKET_NAME)
# Schedule the job
job = aiplatform.CustomPythonPackageTrainingJob(
display_name=DISPLAY_NAME,
#script_path="trainer/test.py",
python_package_gcs_uri=PACKAGE_URI,
python_module_name=ENTRY_POINT,
#requirements=['tensorflow_datasets~=4.2.0', 'SQLAlchemy~=1.4.26', 'google-cloud-secret-manager~=2.7.2', 'cloud-sql-python-connector==0.4.2', 'PyMySQL==1.0.2'],
container_uri=CONTAINER_URI,
)
model = job.run(
dataset=None,
#base_output_dir=f"gs://{BUCKET_NAME}/{PROJECT}/Train_{TIMESTAMP}",
base_output_dir=f"gs://{BUCKET_NAME}/{PROJECT}/",
service_account="vertex-ai-fotokite-service-acc@fotokite-cv-gcp-exploration.iam.gserviceaccount.com",
environment_variables=None,
args=ARGS,
replica_count=REPLICA_COUNT,
machine_type=MACHINE_TYPE,
accelerator_type=ACCELERATOR_TYPE,
accelerator_count=ACCELERATOR_TYPE,
#tensorboard=TENSORBOARD,
sync=SYNC
)
print(model)
print("JOB SUBMITTED")
Regularly this kind of error "The replica workerpool0-0 exited with a non-zero status of 1" is because something is wrong in the process of package the python file or in the code.
You can see these options.
setup.py demo/PKG demo/SOURCES.txt demo/dependency_links.txt demo/requires.txt demo/level.txt trainer/__init__.py trainer/metadata.py trainer/model.py trainer/task.py trainer/utils.py
You could see the official troubleshooting guide from Google Cloud with this type of error and how to see more information about this error.
You can see this oficial documentation about packaging .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.