简体   繁体   English

使用Google Compute Engine / App Engine进行网页搜刮

[英]Web Scraping with Google Compute Engine / App Engine

I've written a python script that uses Selenium to scrape information from a website and stores it in a csv file. 我编写了一个Python脚本,该脚本使用Selenium从网站上抓取信息并将其存储在csv文件中。 It works well on my local machine when I manually execute it but I now want to run the script automatically once per hour for several weeks and safe the data in a database. 当我手动执行本地脚本时,它在我的本地计算机上运行良好,但现在我希望每小时自动运行一次脚本,持续几个星期,以保护数据库中的数据安全。 It may take about 5-10 minutes to run the script. 运行该脚本大约需要5-10分钟。

I've just started off with Google Cloud and it looks like there are several ways of implementing it with either Compute Engine or App Engine. 我刚刚开始使用Google Cloud,看来有几种使用Compute Engine或App Engine实施它的方法。 So far, I get stuck at a certain point with all three ways that I found so far (eg getting the scheduled task call a URL of my backend instance and getting that instance to kick off the script). 到目前为止,我一直使用到目前为止找到的所有三种方法(例如,让计划的任务调用后端实例的URL并让该实例启动脚本)停留在某个点上。 I've tried to: 我试图:

  • Execute the script via Compute Engine and use datastore or cloud sql. 通过Compute Engine执行脚本,并使用数据存储区或Cloud sql。 Unclear if crontab can easily be set up. 不清楚crontab是否可以轻松设置。
  • Use Task Queues and Scheduled Tasks on App Engine. 在App Engine上使用任务队列和计划任务。
  • Use backend instance and Scheduled Tasks on App Engine. 在App Engine上使用后端实例和计划任务。

I'd be curious to hear from others what they would recommend as the easiest and most appropriate way given that this is truly a backend script that does not need a user front end. 考虑到这确实是不需要用户前端的后端脚本,我很想听到其他人推荐的最简单,最合适的方法。

App Engine is feasible but only if you limit your use of Selenium to a .remote out to a site such as http://crossbrowsertesting.com/ -- feasible but messy. App Engine是可行的,但.remote是您将Selenium的使用范围限制为.remotehttp://crossbrowsertesting.com/之类的网站-可行,但杂乱无章。

I'd use Compute Engine -- and cron is trivial to use on any Linux image, see eg http://www.thegeekstuff.com/2009/06/15-practical-crontab-examples/ ! 我会使用Compute Engine -在任何Linux映像上使用cron都很简单,请参见例如http://www.thegeekstuff.com/2009/06/15-practical-crontab-examples/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM