简体   繁体   中英

how can set headers (user-agent), retrieve a web page, capture redirects and accept cookies?

import urllib.request
url="http://espn.com"
f = urllib.request.urlopen(url)
contents = f.read().decode('latin-1')
q = f.geturl()
print(q)

This code will return http://espn.go.com/ , which is what I want -- a redirect web site URL. After looking at the Python documentation, googling, etc., I can't figure out how to also:

  1. Capture redirected web site URL (already working)
  2. Change the user-agent on the out going request
  3. Accept any cookies the web page might want to send back

How can I do this in Python 3? If there is a better module than urllib , I am OK with this.

There is a better module, it's called requests :

import requests

session = requests.Session()
session.headers['User-Agent'] = 'My-requests-agent/0.1'

resp = session.get(url)
contents = resp.text  # If the server said it's latin 1, this'll be unicode (ready decoded)
print(resp.url)       # final URL, after redirects.

requests follows redirects (check resp.history to see what redirects it followed). By using a session (optional), cookies are stored and passed on to subsequent requests. You can set headers per request or per session (so the same extra headers will be sent with every request sent out for that session).

A simple demo using urllib (python3):

#!/usr/bin/env python3
#-*- coding:utf-8 -*-

import os.path
import urllib.request
from urllib.parse import urlencode
from http.cookiejar import CookieJar,MozillaCookieJar

cj = MozillaCookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
urllib.request.install_opener(opener)

cookie_file=os.path.abspath('./cookies.txt')

def load_cookies(cj,cookie_file):
    cj.load(cookie_file)
def save_cookies(cj,cookie_file):
    cj.save(cookie_file,ignore_discard=True,ignore_expires=True)

def dorequest(url,cj=None,data=None,timeout=10,encoding='UTF-8'):
    data = urlencode(data).encode(encoding) if data else None

    request = urllib.request.Request(url)
    request.add_header('User-Agent','Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)')
    f = urllib.request.urlopen(request,data,timeout=timeout)
    return f.read()

def dopost(url,cj=None,data=None,timeout=10,encoding='UTF-8'):
    body = dorequest(url,cj,data,timeout,encoding)
    return body.decode(encoding)

You should check the headers if redirecting happening(30x).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM