Python：urllib.error.HTTPError：HTTP 错误 404：未找到

Question

我写了一个脚本来查找 SO 问题标题中的拼写错误。 我用了大约一个月。效果很好。

但是现在，当我尝试运行它时，我得到了这个。

Traceback (most recent call last):
  File "copyeditor.py", line 32, in <module>
    find_bad_qn(i)
  File "copyeditor.py", line 15, in find_bad_qn
    html = urlopen(url)
  File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.4/urllib/request.py", line 469, in open
    response = meth(req, response)
  File "/usr/lib/python3.4/urllib/request.py", line 579, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.4/urllib/request.py", line 507, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.4/urllib/request.py", line 587, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

这是我的代码

import json
from urllib.request import urlopen
from bs4 import BeautifulSoup
from enchant import DictWithPWL
from enchant.checker import SpellChecker

my_dict = DictWithPWL("en_US", pwl="terms.dict")
chkr = SpellChecker(lang=my_dict)
result = []


def find_bad_qn(a):
    url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"
    html = urlopen(url)
    bsObj = BeautifulSoup(html, "html5lib")
    que = bsObj.find_all("div", class_="question-summary")
    for div in que:
        link = div.a.get('href')
        name = div.a.text
        chkr.set_text(name.lower())
        list1 = []
        for err in chkr:
            list1.append(chkr.word)
        if (len(list1) > 1):
            str1 = ' '.join(list1)
            result.append({'link': link, 'name': name, 'words': str1})


print("Please Wait.. it will take some time")
for i in range(298314,298346):
    find_bad_qn(i)
for qn in result:
    qn['link'] = "https://stackoverflow.com" + qn['link']
for qn in result:
    print(qn['link'], " Error Words:", qn['words'])
    url = qn['link']

更新

这是导致问题的 url。即使这个 url 存在。

https://stackoverflow.com/questions?page=298314&sort=active

我尝试将范围更改为一些较低的值。 现在工作正常。

为什么上面的 url 会发生这种情况？

Answer 1

显然，每页的默认显示问题数是 50，因此您在循环中定义的范围超出了每页 50 个问题的可用页数。 该范围应调整为在每页 50 个问题的总页数内。

此代码将捕获 404 错误，这是您收到错误的原因，并在您超出范围时忽略它。

from urllib.request import urlopen

def find_bad_qn(a):
    url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"
    try:
        urlopen(url)
    except:
        pass

print("Please Wait.. it will take some time")
for i in range(298314,298346):
    find_bad_qn(i)

Answer 2

我有完全一样的问题。 我想使用 urllib 获取的 url 存在并且可以使用普通浏览器访问，但是 urllib 告诉我 404。

我的解决方案是不使用 urllib：

import requests
requests.get(url)

这对我有用。

Answer 3

默认的“用户代理”似乎没有 Mozilla 那么多的访问权限。

尝试导入 Request 和 append , headers={'User-Agent': 'Mozilla/5.0'}到您的网址末尾。

IE：

from urllib.request import Request, urlopen    
url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"    
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})    
html = urlopen(req)

Answer 4

这是因为 URL 不存在，请重新检查您的 URL。 我在重新检查时也遇到了同样的问题我发现我的 URL 不正确然后我改变了它

Answer 5

通过单击链接进行检查。 也许它存在于代码中，这意味着您的代码没有问题，但实际上链接或站点不存在，找不到。

Python：urllib.error.HTTPError：HTTP 错误 404：未找到

问题描述

5 个解决方案

解决方案1
4 已采纳 2017-02-24 14:41:13

解决方案2
4 2018-12-29 15:02:17

解决方案3
1 2020-04-13 04:10:50

解决方案4
1 2021-06-15 10:06:46

解决方案5
-1 2023-01-01 14:24:45

Python：urllib.error.HTTPError：HTTP 错误 404：未找到

问题描述

5 个解决方案

解决方案1 4 已采纳 2017-02-24 14:41:13

解决方案2 4 2018-12-29 15:02:17

解决方案3 1 2020-04-13 04:10:50

解决方案4 1 2021-06-15 10:06:46

解决方案5 -1 2023-01-01 14:24:45

解决方案1
4 已采纳 2017-02-24 14:41:13

解决方案2
4 2018-12-29 15:02:17

解决方案3
1 2020-04-13 04:10:50

解决方案4
1 2021-06-15 10:06:46

解决方案5
-1 2023-01-01 14:24:45