简体   繁体   中英

urllib.request.urlopen succeeding for some URLs and times out for others

I'm trying my hand at some manual web scraping but am running into a problem very early on, just with connecting with a host. Here's my code:

#!/usr/bin/env python3

import urllib.request

USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0"

# url = "https://www.costco.ca/"
url = "https://httpbin.org/anything"

req = urllib.request.Request(
    url, None, headers={"User-Agent": USER_AGENT, "Accept-Encoding": "application/json"}
)

with urllib.request.urlopen(req, timeout=20) as r:
    raw = r.read()
    print(raw.decode())

If I run this with httpbin as the host, it works.

If I run it with the Costco URL, it times out with "socket.timeout: The read operation timed out".

If I try curl instead, curl -s -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0" "https://www.costco.ca/" , it works.

I'm puzzled what's happening. Is urllib sending something that causes the Costco site to block the request? Why does the curl command work and how do I get the Python script using urllib to behave as curl does?

I've managed to get result with requests module, but not with urllib.request :

import requests

url = "https://www.costco.ca/"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
}

t = requests.get(url, headers=headers).text
print(t)

Prints:

<!DOCTYPE html>
<html lang="en-ca">
  <head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">

...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM