使用python在Web代理上抓取网站

Question

I am working on scraping databases that I have access to using the duke library web proxy. 我正在研究使用duke库Web代理可以访问的数据库。 I encountered the issue that since the data base is accessed through a proxy server, I can't directly scrape this database as I would if the database was did not require proxy authentication. 我遇到的问题是，由于数据库是通过代理服务器访问的，因此无法像数据库不需要代理身份验证那样直接刮取该数据库。

I tried several thing: 我尝试了几件事：

I wrote one script that logs into the duke network (https://shib.oit.duke.edu/idp/AuthnEngine'). 我写了一个脚本登录到杜克网络（https://shib.oit.duke.edu/idp/AuthnEngine'）。

I then hardcode in my login data: 然后，我在登录数据中进行硬编码：

login_data = urllib.urlencode({'j_username' : 'userxx',
                           'j_password' : 'passwordxx',
                           'Submit' : 'Enter'
                           })

I then login: 然后我登录：

resp = opener.open('https://shib.oit.duke.edu/idp/AuthnEngine', login_data)

and then I create a cookie jar object to hold the cookies from proxy website. 然后创建一个Cookie罐对象，以保存来自代理网站的Cookie。

then i try to access the database with my script and it is still telling me authentication is required. 然后我尝试使用我的脚本访问数据库，但它仍然告诉我需要身份验证。 I wanted to know how I can get around the authentication required for the proxy server. 我想知道如何解决代理服务器所需的身份验证。

If you have any suggestions please let me know. 如果您有任何建议，请让我知道。

Thank you, Jan 谢谢你，简

Answer 1

A proxy login does not store cookies but instead uses the Proxy-Authorization header. 代理登录不存储cookie，而是使用Proxy-Authorization标头。 This header will need to be sent with every request similar to Cookies . 该标头将与每个类似于Cookies请求一起发送。 The header is of the same format as regular Basic Authentication, although there are different formats possible ( Digest , NTLM .) I suggest you check the headers of a normal login and copy and paste the Proxy-Authorization header that was sent. 标头具有与常规基本身份验证相同的格式，尽管可能存在不同的格式（ Digest ， NTLM 。我建议您检查常规登录的标头，然后复制并粘贴发送的Proxy-Authorization标头。

使用python在Web代理上抓取网站

问题描述

1 个解决方案

解决方案1
0 2012-08-01 17:18:56

使用python在Web代理上抓取网站

问题描述

1 个解决方案

解决方案1 0 2012-08-01 17:18:56

解决方案1
0 2012-08-01 17:18:56