简体   繁体   中英

Setting up a python screen scraper that could work on Google App engine

I am looking to setup a automated screen scraper that will run on Google app engine using python. I want it to scrape the site and put the specified results into a Entity in app engine. I am looking for some directions on what to use. I have seen beautifulsoup but wonder if people could recommend anything else that could run on Google App engine.

Beautifulsoup runs fine on App Engine (just make sure to use 3.0.8, not the iffy 3.1.0). The main alternative, I think, would be html5lib -- I haven't tries it on App Engine but I believe it does run there (quite slowly -- if that's a problem I think you need to stick with BeautifulSoup), eg this service runs on App Engine and is based on html5lib.

I have had good (although slow) results using mechanize and BeautifulSoup. In fact, to save code space on Google App Engine, I use the (old) version of BeautifulSoup included in mechanize.

I have mechanize in a zip file, mechanize.zip . The index of this zip file looks like:

mechanize/
mechanize/__init__.py
mechanize/_auth.py
mechanize/_beautifulsoup.py
mechanize/_clientcookie.py
... etc

Then in my Python code,

import sys
sys.path.insert(0, 'mechanize.zip')

import mechanize
from mechanize._beautifulsoup import BeautifulSoup

另一种选择是lxml ,但它使用C代码,因此不适用于GAE。

I have used BeautifulSoup with great success parsing HTML. Problem is that's all BeautifulSoup does, is parse the HTML. I ended up writing all the http interactions using urlfetch.

To web-scrape my target I need a full fledged code driven browser that can execute javascript on my target site's pages. I think I'm having to dump the python app and go java so I can use HTMLUnit - prototyping underway. - mattb

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM