简体繁体中英

What pure Python library should I use to scrape a website?

原文 2009-10-13 21:58:03 8 5 python/ google-app-engine/ xpath/ beautifulsoup/ mechanize

I currently have some Ruby code used to scrape some websites. I was using Ruby because at the time I was using Ruby on Rails for a site, and it just made sense.

Now I'm trying to port this over to Google App Engine, and keep getting stuck.

I've ported Python Mechanize to work with Google App Engine, but it doesn't support DOM inspection with XPATH.

I've tried the built-in ElementTree, but it choked on the first HTML blob I gave it when it ran into '&mdash'.

Do I keep trying to hack ElementTree in there, or do I try to use something else?

thanks, Mark

5 answers

美丽的汤。

lxml - 比elementtree好100倍

还有scrapy ，可能更多你的胡同。

There are a number of examples of web page scrapers written using pyparsing , such as this one (extracts all URL links from yahoo.com) and this one (for extracting the NIST NTP server addresses). Be sure to use the pyparsing helper method makeHTMLTags, instead of just hand coding "<" + Literal(tagname) + ">" - makeHTMLTags creates a very robust parser, with accommodation for extra spaces, upper/lower case inconsistencies, unexpected attributes, attribute values with various quoting styles, and so on. Pyparsing will also give you more control over special syntax issues, such as custom entities. Also it is pure Python, liberally licensed, and small footprint (a single source module), so it is easy to drop into your GAE app right in with your other application code.

BeautifulSoup is good, but its API is awkward. Try ElementSoup , which provides an ElementTree interface to BeautifulSoup.

What python 3 library should I use for MySQL?

What python library I should use for facebook OAuth and OpenGraph

What library should I use to implement this GUI in Python? [on hold]

What should I use for the backend of a 'social' website?

Could I use Cython with 3rdparty pure python library?

Which Python XML library should I use?

What is the proper URL to scrape this website with python and json?

Use Python to scrape a table from a website

What should I use to use ICQ with Python?

When i try to scrape this website with selenium and python

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question What python 3 library should I use for MySQL? What python library I should use for facebook OAuth and OpenGraph What library should I use to implement this GUI in Python? [on hold] What should I use for the backend of a 'social' website? Could I use Cython with 3rdparty pure python library? Which Python XML library should I use? What is the proper URL to scrape this website with python and json? Use Python to scrape a table from a website What should I use to use ICQ with Python? When i try to scrape this website with selenium and python

Related Tags

What pure Python library should I use to scrape a website?

Question

5 answers

solution1
11 2009-10-13 22:01:06

solution2
6 2009-10-13 22:28:18

solution3
4 2009-10-13 22:29:49

solution4
0 2009-10-13 23:01:53

solution5
0 2009-11-25 00:18:51

What pure Python library should I use to scrape a website?

Question

5 answers

solution1 11 2009-10-13 22:01:06

solution2 6 2009-10-13 22:28:18

solution3 4 2009-10-13 22:29:49

solution4 0 2009-10-13 23:01:53

solution5 0 2009-11-25 00:18:51

solution1
11 2009-10-13 22:01:06

solution2
6 2009-10-13 22:28:18

solution3
4 2009-10-13 22:29:49

solution4
0 2009-10-13 23:01:53

solution5
0 2009-11-25 00:18:51