I am trying to unshorten a list of roughly 150,000 t.co links and my code works for the most part, however, I have a bunch of t.co links that all redirect here , and for some reason requests is getting too many redirects.
def expand_url(url):
s = requests.Session()
try:
r = s.head(url.rstrip(), allow_redirects=True,verify=False)
return r.url.rstrip()
except requests.exceptions.ConnectionError as e:
print(e)
I tried using the line s.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
as suggested in another thread. I also tried increasing the max re-directs and that didn't really help.
Here are some of the t.co links that are causing the issue:
https://t dot co/5FXvHY1Rbx
https://t dot co/L3Ytnz2916
Any suggestions on what to do?
Thanks
Set the max redirects times that you can bear.
http://docs.python-requests.org/en/master/api/#requests.Session.max_redirects
s = requests.Session()
s.max_redirects = 3
the reason why you fall into deadloop because WH did not support head method, it keeps sending you 302 Moved Temporarily
. But actually you have redirected finished (from short url to WH). Try to use r.history
to see all response
import requests
def expand_url(url):
s = requests.Session()
#s.allow_redirects = -1
try:
r = s.get(url.rstrip(),allow_redirects=3,verify=False)
print([resp.url for resp in r.history])
return r.url.rstrip()
except requests.exceptions.ConnectionError as e:
print(e)
print(expand_url("https://t<dot>co/5FXvHY1Rbx"))
Also you can write your own max_redirects.
import requests
def expand_url(url,times):
s = requests.Session()
times -= 1
if not times:
return url
try:
r = s.head(url.rstrip(),verify=False)
location = r.headers.get("location").rstrip()
if url.find(location) > 0:
# in case redirect to same page
return url
next_step = expand_url(location,times) if location else url
return next_step
except requests.exceptions.ConnectionError as e:
print(e)
print(expand_url("https://t<dot>co/5FXvHY1Rbx",4))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.