简体   繁体   English

在Python中哪个最好:urllib2,PycURL或机械化?

[英]Which is best in Python: urllib2, PycURL or mechanize?

Ok so I need to download some web pages using Python and did a quick investigation of my options. 好的,所以我需要使用Python下载一些网页,并对我的选项进行了快速调查。

Included with Python: 包含在Python中:

urllib - seems to me that I should use urllib2 instead. urllib-在我看来,我应该改用urllib2。 urllib has no cookie support, HTTP/FTP/local files only (no SSL) urllib不支持cookie,仅HTTP / FTP /本地文件(不支持SSL)

urllib2 - complete HTTP/FTP client, supports most needed things like cookies, does not support all HTTP verbs (only GET and POST, no TRACE, etc.) urllib2-完整的HTTP / FTP客户端,支持大多数需要的东西,例如cookie,不支持所有HTTP动词(仅GET和POST,不支持TRACE等)

Full featured: 全功能:

mechanize - can use/save Firefox/IE cookies, take actions like follow second link, actively maintained (0.2.5 released in March 2011) 机械化 -可以使用/保存Firefox / IE cookie,采取类似跟随第二个链接的操作,并积极维护(2011年3月发布0.2.5)

PycURL - supports everything curl does (FTP, FTPS, HTTP, HTTPS, GOPHER, TELNET, DICT, FILE and LDAP), bad news: not updated since Sep 9, 2008 (7.19.0) PycURL-支持curl的所有功能(FTP,FTPS,HTTP,HTTPS,GOPHER,TELNET,DICT,FILE和LDAP),坏消息:自2008年9月9日以来未更新(7.19.0)

New possibilities: 新的可能性:

urllib3 - supports connection re-using/pooling and file posting urllib3-支持连接重用/池化和文件发布

Deprecated (aka use urllib/urllib2 instead): 不推荐使用(也可以使用urllib / urllib2代替):

httplib - HTTP/HTTPS only (no FTP) httplib-仅HTTP / HTTPS(无FTP)

httplib2 - HTTP/HTTPS only (no FTP) httplib2-仅HTTP / HTTPS(无FTP)

The first thing that strikes me is that urllib/urllib2/PycURL/mechanize are all pretty mature solutions that work well. 让我印象深刻的第一件事是urllib / urllib2 / PycURL / mechanize都是相当成熟的解决方案,可以很好地工作。 mechanize and PycURL ship with a number of Linux distributions (eg Fedora 13) and BSDs so installation is a non issue typically (so that's good). mechanize和PycURL附带了许多Linux发行版(例如Fedora 13)和BSD,因此安装通常不是问题(很好)。

urllib2 looks good but I'm wondering why PycURL and mechanize both seem very popular, is there something I am missing (ie if I use urllib2 will I paint myself in to a corner at some point?). urllib2看起来不错,但我想知道为什么PycURL和机械化两者似乎都很流行,是否缺少某些东西(即,如果使用urllib2,我是否会在某个时候画一个角?)。 I'd really like some feedback on the pros/cons of these things so I can make the best choice for myself. 我真的很想就这些事情的利弊提供一些反馈,以便为自己做出最佳选择。

Edit: added note on verb support in urllib2 编辑:在urllib2中添加了有关动词支持的注释

I think this talk (at pycon 2009), has the answers for what you're looking for (Asheesh Laroia has lots of experience on the matter). 我认为这个演讲(在pycon 2009上)可以为您寻找的答案提供答案(Asheesh Laroia在此问题上有很多经验)。 And he points out the good and the bad from most of your listing 他指出了您大多数清单中的优点和缺点

From the PYCON 2009 schedule: 根据PYCON 2009时间表:

Do you find yourself faced with websites that have data you need to extract? 您是否发现自己面临着需要提取数据的网站? Would your life be simpler if you could programmatically input data into web applications, even those tuned to resist interaction by bots? 如果您可以通过编程方式将数据输入到Web应用程序中,甚至那些经过调整以抵抗机器人交互的数据,您的生活会更简单吗?

We'll discuss the basics of web scraping, and then dive into the details of different methods and where they are most applicable. 我们将讨论网络抓取的基础知识,然后深入探讨不同方法的详细信息以及最适用的方法。

You'll leave with an understanding of when to apply different tools, and learn about a "heavy hammer" for screen scraping that I picked up at a project for the Electronic Frontier Foundation. 您将了解何时应用不同的工具,并了解我在Electronic Frontier Foundation的一个项目中挑选的用于屏幕抓取的“重锤”。

Atendees should bring a laptop, if possible, to try the examples we discuss and optionally take notes. 参加者应尽可能携带一台笔记本电脑尝试我们讨论的示例并做笔记。

Update: Asheesh Laroia has updated his presentation for pycon 2010 更新: Asheesh Laroia更新了他的pycon 2010演示文稿

  • PyCon 2010: Scrape the Web: Strategies for programming websites that don't expected it PyCon 2010:Scrape the Web:不期望它的网站编程策略

     * My motto: "The website is the API." * Choosing a parser: BeautifulSoup, lxml, HTMLParse, and html5lib. * Extracting information, even in the face of bad HTML: Regular expressions, BeautifulSoup, SAX, and XPath. * Automatic template reverse-engineering tools. * Submitting to forms. * Playing with XML-RPC * DO NOT BECOME AN EVIL COMMENT SPAMMER. * Countermeasures, and circumventing them: o IP address limits o Hidden form fields o User-agent detection o JavaScript o CAPTCHAs * Plenty of full source code to working examples: o Submitting to forms for text-to-speech. o Downloading music from web stores. o Automating Firefox with Selenium RC to navigate a pure-JavaScript service. * Q&A; and workshopping * Use your power for good, not evil. 

Update 2: 更新2:

PyCon US 2012 - Web scraping: Reliably and efficiently pull data from pages that don't expect it PyCon US 2012-Web抓取:可靠,有效地从不需要的页面中提取数据

Exciting information is trapped in web pages and behind HTML forms. 令人兴奋的信息被困在网页中和HTML表单的后面。 In this tutorial, >you'll learn how to parse those pages and when to apply advanced techniques that make >scraping faster and more stable. 在本教程中,您将学习如何解析这些页面以及何时应用使刮擦更快,更稳定的高级技术。 We'll cover parallel downloading with Twisted, gevent, >and others; 我们将介绍Twisted,gevent等并行下载。 analyzing sites behind SSL; 分析SSL背后的网站; driving JavaScript-y sites with Selenium; 用Selenium驱动JavaScript-y网站; and >evading common anti-scraping techniques. >规避常见的防刮擦技术。

Python requests is also a good candidate for HTTP stuff. Python 请求也是HTTP内容的理想选择。 It has a nicer api IMHO, an example http request from their offcial documentation: 它有一个更好的API IMHO,这是他们官方文档中的一个http请求示例:

>>> r = requests.get('https://api.github.com', auth=('user', 'pass'))
>>> r.status_code
204
>>> r.headers['content-type']
'application/json'
>>> r.content
...
  • urllib2 is found in every Python install everywhere, so is a good base upon which to start. urllib2在随处可见的每个Python安装中都可以找到,因此它是一个良好的起点。
  • PycURL is useful for people already used to using libcurl, exposes more of the low-level details of HTTP, plus it gains any fixes or improvements applied to libcurl. PycURL对于已经习惯使用libcurl的人们很有用,它公开了HTTP的更多低级细节,此外,它还获得了应用于libcurl的任何修复或改进。
  • mechanize is used to persistently drive a connection much like a browser would. mechanize用于持久地驱动连接,就像浏览器一样。

It's not a matter of one being better than the other, it's a matter of choosing the appropriate tool for the job. 这不是一个比另一个更好的问题,而是选择适合该工作的工具的问题。

To "get some webpages", use requests ! 要“获取一些网页”,请使用请求

From http://docs.python-requests.org/en/latest/ : http://docs.python-requests.org/en/latest/

Python's standard urllib2 module provides most of the HTTP capabilities you need, but the API is thoroughly broken. Python的标准urllib2模块提供了您需要的大多数HTTP功能,但是该API已被彻底破坏。 It was built for a different time — and a different web. 它是为不同的时间和不同的Web构建的。 It requires an enormous amount of work (even method overrides) to perform the simplest of tasks. 要执行最简单的任务,需要大量的工作(甚至覆盖方法)。

Things shouldn't be this way. 事情不应该这样。 Not in Python. 不在Python中。

>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'private_gists': 419, u'total_private_repos': 77, ...}

Don't worry about "last updated". 不用担心“最新更新”。 HTTP hasn't changed much in the last few years ;) 在过去的几年中,HTTP并没有太大改变;)

urllib2 is best (as it's inbuilt), then switch to mechanize if you need cookies from Firefox. urllib2是最好的(因为它是内置的),如果您需要Firefox的cookie,请切换为机械化。 mechanize can be used as a drop-in replacement for urllib2 - they have similar methods etc. Using Firefox cookies means you can get things from sites (like say StackOverflow) using your personal login credentials. 机械化可以用作urllib2的替代品-它们具有类似的方法等。使用Firefox cookie意味着您可以使用个人登录凭据从网站(例如StackOverflow)中获取信息。 Just be responsible with your number of requests (or you'll get blocked). 只需对您的请求数量负责(否则您将被阻止)。

PycURL is for people who need all the low level stuff in libcurl. PycURL适用于需要libcurl中所有低级内容的人。 I would try the other libraries first. 我会先尝试其他图书馆。

Urllib2仅支持HTTP GET和POST,可能有解决方法,但是如果您的应用程序依赖于其他HTTP动词,则您可能会希望使用其他模块。

Take a look on Grab (http://grablib.org). 看看Grab(http://grablib.org)。 It is a network library which provides two main interfaces: 1) Grab for creating network requests and parsing retrieved data 2) Spider for creating bulk site scrapers 它是一个提供两个主要接口的网络库:1)用于创建网络请求和解析检索到的数据的Grab 2)用于创建批量站点抓取工具的Spider

Under the hood Grab uses pycurl and lxml but it is possible to use other network transports (for example, requests library). 在后台,Grab使用pycurl和lxml,但是可以使用其他网络传输(例如,请求库)。 Requests transport is not well tested yet. 请求传输尚未经过良好测试。

Every python library that speaks HTTP has its own advantages. 每个使用HTTP的python库都有其自身的优势。

Use the one that has the minimum amount of features necessary for a particular task. 使用具有特定任务所需最少功能的工具。

Your list is missing at least urllib3 - a cool third party HTTP library which can reuse a HTTP connection, thus speeding up greatly the process of retrieving multiple URLs from the same site. 您的列表至少缺少urllib3-一个很酷的第三方HTTP库,该库可以重用HTTP连接,从而大大加快了从同一站点检索多个URL的过程。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM