简体   繁体   中英

How to schedule a GCP VM Instance with a custom python command when the instance gets started?

I have a web scrapper which scrape data from an e-commerce site and right now, my data gets stored in BigQuery tables from pandas dataframes. But I am doing all these things manually. For example, starting the VM instance from the GCP site, then connecting my local machine with a remote SSH, then opening the terminal on the project folder, and running

$ python main.py

to start the scarping. And then after the process is completed, I turn off the VM instance manually again. Now, what I want is to automate this task, which will automatically start the VM instance on the first date of every month, and then scrape the e-commerce site data, and then when the program will be completed, it will automatically turn off the VM instance.

My program takes almost 40 hours to complete getting all data from the e-commerce site. I was looking for Cloud Functions where I have seen the maximum time limit is 540 seconds . As my program takes so much time to get executed, I am not sure whether Cloud functions will work for my case or not.

Is there any solution to automate these processes? I am very new on GCP, I am sorry if it's a very trivial problem to ask for a solution.

Cloud Functions is not suitable for long-running task. So I think it is ok to setup automated task working on GCE is correct decision.

You can shutdown your instance itself using Compute Engine API. For example, you can use gcloud CLI tool with command like gcloud compute instances stop $instance [1].

Note
Don't forget to setup ServiceAccount with right permissions and attach it to your VM to stop itself through Compute Engine API.[2]

And also you can use startup_script[3] that is feature of GCE that enable to run the command after VM started.

So, you may create startup_script like below, and it will work for your automation.

  • STEP 1. execute python main.py
  • STEP 2. execute after STEP 1. gcloud compute instance stop $instance .

References

[1] gcloud CLI reference
https://cloud.google.com/sdk/gcloud/reference/compute/instances/stop

[2] ServiceAccount with Instance
https://cloud.google.com/compute/docs/access/service-accounts#associating_a_service_account_to_an_instance

[3] Startup Script
https://cloud.google.com/compute/docs/instances/startup-scripts

You could for example do following architecture:

  1. Create VM in Compute Engine, install python and put there python script. Edit VM and add there as startup script command to launch python script. So each time you will restart VM - this python script will run.
  2. Create PubSub topic.
  3. Add in your python code at the end part that will send information to PubSub topic.
  4. Create Cloud Function that will start VM Compute Engine. Cloud Function should be triggered by HTTP.
  5. Create Cloud Function that will stop VM Compute Engine. Cloud Function should be triggered by PubSub topic you defined.
  6. Create Cloud Scheduler that will trigger starting Cloud Function (point 4) once per month or something.

So it will work like this: At the beginning of month cloud scheduler will trigger cloud function to start VM. VM will start and launch automatically startup script which is your main.py . When script is finished msg will be send to PubSub topic. PubSub topic will trigger second Cloud Function that will stop VM machine.

The same in next month. The same in next month. etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM