I'm trying my hand at some manual web scraping but am running into a problem very early on, just with connecting with a host. Here's my code:
#!/usr/bin/env python3
import urllib.request
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0"
# url = "https://www.costco.ca/"
url = "https://httpbin.org/anything"
req = urllib.request.Request(
url, None, headers={"User-Agent": USER_AGENT, "Accept-Encoding": "application/json"}
)
with urllib.request.urlopen(req, timeout=20) as r:
raw = r.read()
print(raw.decode())
If I run this with httpbin
as the host, it works.
If I run it with the Costco URL, it times out with "socket.timeout: The read operation timed out".
If I try curl instead, curl -s -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0" "https://www.costco.ca/"
, it works.
I'm puzzled what's happening. Is urllib sending something that causes the Costco site to block the request? Why does the curl command work and how do I get the Python script using urllib to behave as curl does?
I've managed to get result with requests
module, but not with urllib.request
:
import requests
url = "https://www.costco.ca/"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
}
t = requests.get(url, headers=headers).text
print(t)
Prints:
<!DOCTYPE html>
<html lang="en-ca">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.