urllib.request.urlopen succeeding for some URLs and times out for others

Question

I'm trying my hand at some manual web scraping but am running into a problem very early on, just with connecting with a host. Here's my code:

#!/usr/bin/env python3

import urllib.request

USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0"

# url = "https://www.costco.ca/"
url = "https://httpbin.org/anything"

req = urllib.request.Request(
    url, None, headers={"User-Agent": USER_AGENT, "Accept-Encoding": "application/json"}
)

with urllib.request.urlopen(req, timeout=20) as r:
    raw = r.read()
    print(raw.decode())

If I run this with httpbin as the host, it works.

If I run it with the Costco URL, it times out with "socket.timeout: The read operation timed out".

If I try curl instead, curl -s -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0" "https://www.costco.ca/" , it works.

I'm puzzled what's happening. Is urllib sending something that causes the Costco site to block the request? Why does the curl command work and how do I get the Python script using urllib to behave as curl does?

Answer 1

I've managed to get result with requests module, but not with urllib.request :

import requests

url = "https://www.costco.ca/"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
}

t = requests.get(url, headers=headers).text
print(t)

Prints:

<!DOCTYPE html>
<html lang="en-ca">
  <head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">

...

urllib.request.urlopen succeeding for some URLs and times out for others

Question

1 answers

solution1
0 2021-06-05 16:03:41

urllib.request.urlopen succeeding for some URLs and times out for others

Question

1 answers

solution1 0 2021-06-05 16:03:41

solution1
0 2021-06-05 16:03:41