python 请求：如何通过代理对 post 和访问文件进行身份验证

Question

I am trying to download a research article pdf via a university proxy into which I need to login.我正在尝试通过我需要登录的大学代理下载研究文章 pdf。 I tried following [this answer][1], but the resulting download contains only the login website.我尝试遵循 [this answer][1]，但结果下载仅包含登录网站。

The article url might look like this: https://iopscience.iop.org/article/10.3847/2041-8213/aaf743/pdf .文章网址可能如下所示： https://iopscience.iop.org/article/10.3847/2041-8213/aaf743/pdf : https://iopscience.iop.org/article/10.3847/2041-8213/aaf743/pdf 。 (this one happens to be open access, but others need to be accessed this way). （这个碰巧是开放获取，但其他需要通过这种方式访问）。
In the browser, I access this through a proxy: https://login.emedien.ub.my-university.edu/login?qurl=https%3a%2f%2fiopscience.iop.org%2farticle%2f10.3847%2f2041-8213%2faaf743%2fpdf .在浏览器中，我通过代理访问它： https://login.emedien.ub.my-university.edu/login?qurl=https%3a%2f%2fiopscience.iop.org%2farticle%2f10.3847%2f2041-8213%2faaf743%2fpdf : https://login.emedien.ub.my-university.edu/login?qurl=https%3a%2f%2fiopscience.iop.org%2farticle%2f10.3847%2f2041-8213%2faaf743%2fpdf 。 This url is stored in the variable long_proxy in the code sample below.此 url 存储在以下代码示例中的变量long_proxy中。

In the browser, this brings up a login form:在浏览器中，这会显示一个登录表单：

 <form action="/login" method="post"> <input name="ezproxycsrftoken" type="hidden" value="aBcDeFgH12345"/> <input name="url" type="hidden" value="https://iopscience.iop.org/article/10.3847/2041-8213/aaf743/pdf"> <table> <tr><td>University Username:</td><td><input name="user" style="width:250px" tabindex="1" type="text"/></td></tr> <tr><td>Password:</td><td><input name="pass" style="width:250px" tabindex="2" type="password"/></td></tr> </table> </input> </form>

Upon entering the username/password, I get forwarded to输入用户名/密码后，我被转发到
https://iopscience-iop-org.emedien.ub.uni-muenchen.de/article/10.3847/2041-8213/aaf743/pdf https://iopscience-iop-org.emedien.ub.uni-muenchen.de/article/10.3847/2041-8213/aaf743/pdf
which brings up the PDF in the browser.这会在浏览器中显示 PDF。 I call this url short_proxy in the code sample below.我在下面的代码示例中将此 url short_proxy 。

I try to do that with python requests in the following way:我尝试通过以下方式使用 python 请求来做到这一点：

user_name = 'myname'
passwd = 'mypassword'

with requests.Session() as session:

    session.headers.update({'User-Agent': 'Mozilla/5.0'})

    # Parse the input form for the hidden input
    r2      = requests.get(long_proxy)
    soup    = bs4.BeautifulSoup(r2.text, "html.parser")
    form    = soup.find('form')
    hidden  = form.find('input', attrs={'type':'hidden', 'name':'ezproxycsrftoken'}).attrs['value']
    url_res = form.find('input', attrs={'type':'hidden', 'name':'url'}).attrs['value']

    # set up the login

    payload = {
        'user': user_name,
        'pass': passwd,
        'ezproxycsrftoken': hidden,
        'url': url_res
    }

    # post login

    post = session.post(login, data=payload)

    # get data

    r3 = session.get(short_proxy)
    with open('file.pdf', 'wb') as fid:
        fid.write(r3.content)

However the downloaded file is not actually a PDF, but turns out to be the html code of the login page.然而，下载的文件实际上并不是 PDF，而是登录页面的 html 代码。

Any ideas how to get the PDF?任何想法如何获取PDF？

  [1]: https://stackoverflow.com/questions/37816565/python-authentication-with-requests-library-via-post

Answer 1

You're using requests.Session() in order to save cookies/session that the website gives you, yet you're using requests.get() instead of session.get() for your initial request where you fetch your longproxy .您正在使用requests.Session()来保存网站提供给您的 cookie/会话，但您正在使用requests.get()而不是session.get()作为您获取longproxy初始请求。 Changing your改变你的

r2      = requests.get(long_proxy)

to到

r2      = session.get(long_proxy)

Should fix your issue.应该解决你的问题。 I can not verify this however.但是，我无法验证这一点。

Also note that your long_proxy另请注意，您的 long_proxy

https://login.emedien.ub.uni-muenchen.de/login?qurl=https://iopscience.iop.org/article/10.3847/2041-8213/aaf743/

is simply the login url, followed by the pdf url.只是登录 url，后跟 pdf url。 So you don't really have to fetch that.所以你真的没有必要去拿那个。
This could save you some extra requests / execution time这可以为您节省一些额外的请求/执行时间

python 请求：如何通过代理对 post 和访问文件进行身份验证

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-02-11 14:14:05

python 请求：如何通过代理对 post 和访问文件进行身份验证

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-02-11 14:14:05

解决方案1
1 已采纳 2020-02-11 14:14:05