简体   繁体   中英

Python: Issues with httplib & requests; https seems to cause a redirect then BadStatusLine exception

I'm currently trying to use BeautifulSoup to scrape some information from the discogs website which isn't available through their API. Unfortunately I cannot seem to connect to the site via urllib2 , httplib or requests without running into a BadStatusLine exception.

I believe this to be due to any request to http://www.discogs.com being redirected to https://www.discogs.com . I have been able to establish that there is a direction going on by using the following code:

r_link = "http://www.discogs.com"
print "Trying " + r_link
r = requests.get(r_link, allow_redirects=False)
print(r.status_code, r.reason, r.history, r.headers['Location'])

This returns:

Trying http://www.discogs.com
(301, 'Moved Permanently', [], 'https://www.discogs.com/')

If I'm understanding this properly, this means that any request to http://www.discogs.com will be redirected to https://www.discogs.com . So one would think that the obvious solution is to put one's request to https://www.discogs.com straight away. Well, unfortunately, doing so with the above code (ie adding the s into the r_link path) results in the BadStatusCode error...

Trying https://www.discogs.com
Traceback (most recent call last):
  File "start.py", line 26, in <module>
    r = requests.get(r_link, allow_redirects=False)
  File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 67, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 53, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 426, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",))

From the examples in the requests documentation, I should have no problem dealing with a https link. Indeed, trying the above code with https://www.google.com results in a 302 response and a successful redirection when using the url in r.headers['Location'] .

So what's the issue? Why is this happening? Is this due to a mistake I'm making? Could this be something specific to my device/set up? Is this something specific to discogs' server? I'm at a loss as how to diagnose this problem.

Thanks.

Add a user-agent and the request will work fine:

h = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}
r_link = "https://www.discogs.com"
print ("Trying " + r_link)
r = requests.get(r_link,headers=h)
print(r.status_code, r.reason, r.history, r.headers)
print(r.content)

A working example below:

In [19]: h = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}


In [20]: r_link = "https://www.discogs.com"

In [21]: r = requests.get(r_link, headers=h)

In [22]: print(r.status_code, r.reason, r.history, r.headers)
(200, 'OK', [], {'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'sid=fad997b268420522ac0242de41fc694c; Domain=www.discogs.com; Expires=Sun, 19-Apr-2026 17:04:09 GMT; Path=/, language2=en; Domain=www.discogs.com; Path=/, session="9H1LFLTWiCMSowA7nKbUYlHU4N8=?"; Domain=www.discogs.com; Secure; HttpOnly; Path=/', 'Server': 'nginx/1.8.1', 'Connection': 'keep-alive', 'Date': 'Thu, 21 Apr 2016 17:04:10 GMT', 'Content-Type': 'text/html; charset=utf-8'})

In [23]: from bs4 import  BeautifulSoup

In [24]: soup.select("#email")
Out[24]: [<input autocaptialize="off" autocomplete="off" id="email" name="email" placeholder="Enter your email address" type="text"/>]

In [25]: soup.select("#username")
Out[25]: [<input autocaptialize="off" autocomplete="off" id="username" name="username" placeholder="Choose a username" type="text"/>]

If you want to login:

h = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}


login = "https://www.discogs.com/login?return_to=%2F"
with requests.session() as s:
    r = s.post(login, data={"username":"your_user","password":"your_pass","Action.Login":""}, headers=h)
    print(r.content)

If we run it you see we get to https://www.discogs.com/my :

In [27]: h = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}

In [28]: login = "https://www.discogs.com/login?return_to=%2F"

In [29]: with requests.session() as s:
   ....:         r = s.post(login, data={"username":"xxxxxxxx","password":"xxxxxxxx","Action.Login":""}, headers=h)
   ....:         print(r.url)
   ....:     
https://www.discogs.com/my

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM