简体   繁体   中英

Web Scraping with Google Compute Engine / App Engine

I've written a python script that uses Selenium to scrape information from a website and stores it in a csv file. It works well on my local machine when I manually execute it but I now want to run the script automatically once per hour for several weeks and safe the data in a database. It may take about 5-10 minutes to run the script.

I've just started off with Google Cloud and it looks like there are several ways of implementing it with either Compute Engine or App Engine. So far, I get stuck at a certain point with all three ways that I found so far (eg getting the scheduled task call a URL of my backend instance and getting that instance to kick off the script). I've tried to:

  • Execute the script via Compute Engine and use datastore or cloud sql. Unclear if crontab can easily be set up.
  • Use Task Queues and Scheduled Tasks on App Engine.
  • Use backend instance and Scheduled Tasks on App Engine.

I'd be curious to hear from others what they would recommend as the easiest and most appropriate way given that this is truly a backend script that does not need a user front end.

App Engine is feasible but only if you limit your use of Selenium to a .remote out to a site such as http://crossbrowsertesting.com/ -- feasible but messy.

I'd use Compute Engine -- and cron is trivial to use on any Linux image, see eg http://www.thegeekstuff.com/2009/06/15-practical-crontab-examples/ !

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM