简体繁体 English

网页抓取时访问被拒绝 403；该怎么办？

[英]Access Denied 403 when webscraping; What to do?

原文 2021-07-08 07:30:37 6 1 python/ web-scraping/ server/ http-headers/ http-status-code-403

I was testing a scraping algorithm that I had built.我正在测试我构建的抓取算法。 I made a request to https://www2.hm.com/fi_fi/miesten.html but misspecified the user-agent information.我向https://www2.hm.com/fi_fi/miesten.html提出请求，但错误指定了用户代理信息。 It seems that this triggered an immediate ban (not sure) Scraping their site should be fine - their robots.txt says: User-agent: * Disallow: )似乎这触发了立即禁止（不确定）抓取他们的网站应该没问题-他们的 robots.txt 说：用户代理：* 禁止：）

Example of making a request to HM and the subsequent server response向 HM 发出请求和后续服务器响应的示例

I erased the user agent and proxy information due to privacy concerns.由于隐私问题，我删除了用户代理和代理信息。 However, they are nothing out of the ordinary.然而，它们并没有什么不同寻常。

I receive the following as response:我收到以下回复：

"b'\\nAccess Denied\\n\\n "b'\\n访问被拒绝\\n\\n

\\n \\nYou don't have permission to access "http://www2.hm.com/fi_fi/miesten.html" on this server. \\n \\n您无权访问此服务器上的“http://www2.hm.com/fi_fi/miesten.html”。

\\nReference #18.2796ef50.1625728417.f9aab80\\n\\n\\n'" \\n参考 #18.2796ef50.1625728417.f9aab80\\n\\n\\n'"

So my question is: is there anything that I can do to lift this ban?所以我的问题是：我能做些什么来解除这个禁令？ Can i connect someone from their end and ask to lift it?我可以从他们的一端连接某人并要求解除它吗？ If so, where can this information usually be found.如果是这样，通常在哪里可以找到这些信息。 Although this question concern this site in particular, this is a much broader question.虽然这个问题特别关注这个网站，但这是一个更广泛的问题。 In the case of a ban, can the user try to connect someone from the server?在禁止的情况下，用户可以尝试从服务器连接某人吗？ I thought about contacting customer support, but I heavily suspect they they cannot help with this issue, and won't even understand what it is about.我想联系客户支持，但我严重怀疑他们无法帮助解决这个问题，甚至不明白这是怎么回事。

I have googled this issue, but not found anything of help.我用谷歌搜索了这个问题，但没有找到任何帮助。 They usually advise to clear cache, memory etc. This is not the problem here.他们通常建议清除缓存、内存等。这不是这里的问题。 I can access the site via Chrome or other browsers, but when using requests via python, this problem appears.我可以通过 Chrome 或其他浏览器访问该站点，但是通过 python 使用请求时，会出现此问题。