[英]Web Scraping with Google Compute Engine / App Engine
I've written a python script that uses Selenium to scrape information from a website and stores it in a csv file. 我编写了一个Python脚本,该脚本使用Selenium从网站上抓取信息并将其存储在csv文件中。 It works well on my local machine when I manually execute it but I now want to run the script automatically once per hour for several weeks and safe the data in a database.
当我手动执行本地脚本时,它在我的本地计算机上运行良好,但现在我希望每小时自动运行一次脚本,持续几个星期,以保护数据库中的数据安全。 It may take about 5-10 minutes to run the script.
运行该脚本大约需要5-10分钟。
I've just started off with Google Cloud and it looks like there are several ways of implementing it with either Compute Engine or App Engine. 我刚刚开始使用Google Cloud,看来有几种使用Compute Engine或App Engine实施它的方法。 So far, I get stuck at a certain point with all three ways that I found so far (eg getting the scheduled task call a URL of my backend instance and getting that instance to kick off the script).
到目前为止,我一直使用到目前为止找到的所有三种方法(例如,让计划的任务调用后端实例的URL并让该实例启动脚本)停留在某个点上。 I've tried to:
我试图:
I'd be curious to hear from others what they would recommend as the easiest and most appropriate way given that this is truly a backend script that does not need a user front end. 考虑到这确实是不需要用户前端的后端脚本,我很想听到其他人推荐的最简单,最合适的方法。
App Engine is feasible but only if you limit your use of Selenium to a .remote
out to a site such as http://crossbrowsertesting.com/ -- feasible but messy. App Engine是可行的,但
.remote
是您将Selenium的使用范围限制为.remote
到http://crossbrowsertesting.com/之类的网站-可行,但杂乱无章。
I'd use Compute Engine -- and cron
is trivial to use on any Linux image, see eg http://www.thegeekstuff.com/2009/06/15-practical-crontab-examples/ ! 我会使用Compute Engine -在任何Linux映像上使用
cron
都很简单,请参见例如http://www.thegeekstuff.com/2009/06/15-practical-crontab-examples/ !
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.