简体   繁体   English

我应该使用什么纯Python库来抓取网站?

[英]What pure Python library should I use to scrape a website?

I currently have some Ruby code used to scrape some websites. 我目前有一些Ruby代码用来刮一些网站。 I was using Ruby because at the time I was using Ruby on Rails for a site, and it just made sense. 我当时正在使用Ruby,因为当时我正在使用Ruby on Rails创建一个站点,这只是有意义的。

Now I'm trying to port this over to Google App Engine, and keep getting stuck. 现在我正试图将其移植到Google App Engine,并继续陷入困境。

I've ported Python Mechanize to work with Google App Engine, but it doesn't support DOM inspection with XPATH. 我已将Python Mechanize移植到与Google App Engine一起使用,但它不支持使用XPATH进行DOM检查。

I've tried the built-in ElementTree, but it choked on the first HTML blob I gave it when it ran into '&mdash'. 我已经尝试了内置的ElementTree,但是当它遇到'&mdash'时,我在第一个HTML blob上窒息了。

Do I keep trying to hack ElementTree in there, or do I try to use something else? 我是否一直试图在那里破解ElementTree,或者我是否尝试使用其他东西?

thanks, Mark 谢谢,马克

美丽的汤。

lxml - 比elementtree好100倍

还有scrapy ,可能更多你的胡同。

There are a number of examples of web page scrapers written using pyparsing , such as this one (extracts all URL links from yahoo.com) and this one (for extracting the NIST NTP server addresses). 有许多使用pyparsing编写的网页抓取器的例子,例如这个 (从yahoo.com提取所有URL链接)和这个 (用于提取NIST NTP服务器地址)。 Be sure to use the pyparsing helper method makeHTMLTags, instead of just hand coding "<" + Literal(tagname) + ">" - makeHTMLTags creates a very robust parser, with accommodation for extra spaces, upper/lower case inconsistencies, unexpected attributes, attribute values with various quoting styles, and so on. 一定要使用pyparsing helper方法makeHTMLTags,而不是手工编写"<" + Literal(tagname) + ">" - makeHTMLTags创建一个非常强大的解析器,可以容纳额外的空间,大小写不一致,意外的属性,具有各种引用样式的属性值,依此类推。 Pyparsing will also give you more control over special syntax issues, such as custom entities. Pyparsing还可以让您更好地控制特殊语法问题,例如自定义实体。 Also it is pure Python, liberally licensed, and small footprint (a single source module), so it is easy to drop into your GAE app right in with your other application code. 此外,它是纯Python,自由许可,占用空间小(单个源模块),因此很容易使用其他应用程序代码放入GAE应用程序。

BeautifulSoup is good, but its API is awkward. BeautifulSoup很好,但它的API很笨拙。 Try ElementSoup , which provides an ElementTree interface to BeautifulSoup. 尝试使用ElementSoup ,它为BeautifulSoup提供ElementTree接口。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM