简体   繁体   English

使用python urllib下载受保护的文件

[英]Downloading a protected file using python urllib

I am trying to download a PDF file that is located here http://elwatan.com/pdf/telecharger.php?dir=JOURNAL&file=20120524.pdf , however, this pdf file require to be logged in before you download it. 我正在尝试下载位于http://elwatan.com/pdf/telecharger.php?dir=JOURNAL&file=20120524.pdf的PDF文件,但是,下载该PDF文件之前需要先登录。 I was able to log in, but the server redirects me to the home page http://elwatan.com , and when i try to fetch the pdf's url again, i can't download it cause it seems that i am not logged in ! 我能够登录,但是服务器将我重定向到主页http://elwatan.com ,当我尝试再次获取pdf的URL时,我无法下载它,原因是我似乎未登录! I think that i need to use cookies, right? 我认为我需要使用Cookie,对吗?

if yes, can you please explain me how to, cause i never used them before. 如果是,请您能解释一下该怎么做,因为我以前从未使用过它们。 ?

Thank's :) 谢谢 :)

The mechanize library is very useful for situations like this. 机械化库对于这种情况非常有用。 It simulates the browser, which includes filling in forms (like login forms) and keeping state such as cookies. 它模拟了浏览器,包括填写表单(例如登录表单)和保持状态(例如Cookie)。 With it, you could log in to the site and then navigate to the pdf file. 有了它,您可以登录到站点,然后导航到pdf文件。 You would use something like the following code: 您将使用类似以下代码的内容:

br = mechanize.Browser()
br.open(login_url)
#code to log in with br
data = br.open(pdf_url).get_data()

You would then have to parse the data as a pdf file and then you can do whatever you need to with it. 然后,您将不得不将数据解析为pdf文件,然后您可以对它进行任何处理。

When using that web application, a "session" is generated for you. 使用该Web应用程序时,将为您生成一个“会话”。 Session details are stored in your client within a cookie. 会话详细信息存储在您的客户端的Cookie中。 Your client sends the cookie contents with each HTTP request. 您的客户端随每个HTTP请求一起发送cookie内容。 By doing so, the web application knows that your HTTP requests correspond to the same session. 这样,Web应用程序知道您的HTTP请求对应于同一会话。 Initially, you are just an unknown user within that session. 最初,您只是该会话中的未知用户。 After logging in, the web application knows that requests within that session come from an authorized user. 登录后,Web应用程序知道该会话中的请求来自授权用户。

You have two options: 您有两种选择:

  • log in via browser, craft the cookie and fake the browser in subsequent requests using Python 通过浏览器登录,制作Cookie并使用Python在后续请求中伪造浏览器
  • do everything in Python (starting from the initial request, logging in, document retrieval) 用Python完成所有操作(从初始请求开始,登录,获取文档)

Both can be a considerable amount of work (especially if you are new to these things), because you have to adjust your code to the specifics of the web application. 两者都可能是大量的工作(特别是如果您不熟悉这些东西的话),因为您必须根据网络应用程序的具体情况来调整代码。 A library like mechanize (as already mentioned by others) can save some work. 像机械化这样的库(已被其他人提及)可以节省一些工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM