Setting up a python screen scraper that could work on Google App engine

Question

I am looking to setup a automated screen scraper that will run on Google app engine using python. I want it to scrape the site and put the specified results into a Entity in app engine. I am looking for some directions on what to use. I have seen beautifulsoup but wonder if people could recommend anything else that could run on Google App engine.

Answer 1

Beautifulsoup runs fine on App Engine (just make sure to use 3.0.8, not the iffy 3.1.0). The main alternative, I think, would be html5lib -- I haven't tries it on App Engine but I believe it does run there (quite slowly -- if that's a problem I think you need to stick with BeautifulSoup), eg this service runs on App Engine and is based on html5lib.

Answer 2

I have had good (although slow) results using mechanize and BeautifulSoup. In fact, to save code space on Google App Engine, I use the (old) version of BeautifulSoup included in mechanize.

I have mechanize in a zip file, mechanize.zip . The index of this zip file looks like:

mechanize/
mechanize/__init__.py
mechanize/_auth.py
mechanize/_beautifulsoup.py
mechanize/_clientcookie.py
... etc

Then in my Python code,

import sys
sys.path.insert(0, 'mechanize.zip')

import mechanize
from mechanize._beautifulsoup import BeautifulSoup

Answer 3

另一种选择是lxml ，但它使用C代码，因此不适用于GAE。

Answer 4

I have used BeautifulSoup with great success parsing HTML. Problem is that's all BeautifulSoup does, is parse the HTML. I ended up writing all the http interactions using urlfetch.

To web-scrape my target I need a full fledged code driven browser that can execute javascript on my target site's pages. I think I'm having to dump the python app and go java so I can use HTMLUnit - prototyping underway. - mattb

Setting up a python screen scraper that could work on Google App engine

Question

4 answers

solution1
4 ACCPTED 2010-03-09 02:24:30

solution2
1 2010-10-16 01:22:22

solution3
0 2010-03-09 01:42:13

solution4
0 2010-04-17 22:41:36

Setting up a python screen scraper that could work on Google App engine

Question

4 answers

solution1 4 ACCPTED 2010-03-09 02:24:30

solution2 1 2010-10-16 01:22:22

solution3 0 2010-03-09 01:42:13

solution4 0 2010-04-17 22:41:36

solution1
4 ACCPTED 2010-03-09 02:24:30

solution2
1 2010-10-16 01:22:22

solution3
0 2010-03-09 01:42:13

solution4
0 2010-04-17 22:41:36