Python：urllib urlopen 卡住，超时错误

Question

正如标题所述， urlopen 被困在 URL 的打开中。

编码：

from bs4 import BeautifulSoup as soup  # HTML data structure
from urllib.request import urlopen as uReq  # Web client

page_url = "https://store.hp.com/us/en/pdp/hp-laserjet-pro-m404n?jumpid=ma_weekly-deals_product-tile_printers_3_w1a52a_hp-laserjet-pro-m404"

uClient = uReq(page_url)

# parses html into a soup data structure to traverse html
# as if it were a json data type.
page_soup = soup(uClient.read(), "html.parser")

uClient.close()

print(page_soup)

问题：它卡在 uReq 上。 但是，如果您要用以下链接替换 page_url，则一切正常。

page_url= "http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"

错误：超时错误

我怎样才能实现打开给定的 URL，用于 Web Scraping 的目的？

编辑

Answer 1

一些网站需要User-Agent标头才能产生成功的请求。 导入urllib.request.Request并修改你的代码如下

from bs4 import BeautifulSoup as soup  # HTML data structure
from urllib.request import urlopen as uReq, Request  # Web client

page_url = "https://store.hp.com/us/en/pdp/hp-laserjet-pro-m404n?jumpid=ma_weekly-deals_product-tile_printers_3_w1a52a_hp-laserjet-pro-m404"

uClient = uReq(Request(page_url, headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0'
}))

# parses html into a soup data structure to traverse html
# as if it were a json data type.
page_soup = soup(uClient.read(), "html.parser")

uClient.close()

print(page_soup)

你会没事的

Python：urllib urlopen 卡住，超时错误

问题描述

1 个解决方案

解决方案1
0 2020-03-01 17:04:54

Python：urllib urlopen 卡住，超时错误

问题描述

1 个解决方案

解决方案1 0 2020-03-01 17:04:54

解决方案1
0 2020-03-01 17:04:54