[英]Access Denied 403 when webscraping; What to do?
I was testing a scraping algorithm that I had built.我正在测试我构建的抓取算法。 I made a request to https://www2.hm.com/fi_fi/miesten.html but misspecified the user-agent information.
我向https://www2.hm.com/fi_fi/miesten.html提出请求,但错误指定了用户代理信息。 It seems that this triggered an immediate ban (not sure) Scraping their site should be fine - their robots.txt says: User-agent: * Disallow: )
似乎这触发了立即禁止(不确定)抓取他们的网站应该没问题-他们的 robots.txt 说:用户代理:* 禁止:)
Example of making a request to HM and the subsequent server response向 HM 发出请求和后续服务器响应的示例
I erased the user agent and proxy information due to privacy concerns.由于隐私问题,我删除了用户代理和代理信息。 However, they are nothing out of the ordinary.
然而,它们并没有什么不同寻常。
I receive the following as response:我收到以下回复:
"b'\\nAccess Denied\\n\\n "b'\\n访问被拒绝\\n\\n
\\nReference #18.2796ef50.1625728417.f9aab80\\n\\n\\n'" \\n参考 #18.2796ef50.1625728417.f9aab80\\n\\n\\n'"
So my question is: is there anything that I can do to lift this ban?所以我的问题是:我能做些什么来解除这个禁令? Can i connect someone from their end and ask to lift it?
我可以从他们的一端连接某人并要求解除它吗? If so, where can this information usually be found.
如果是这样,通常在哪里可以找到这些信息。 Although this question concern this site in particular, this is a much broader question.
虽然这个问题特别关注这个网站,但这是一个更广泛的问题。 In the case of a ban, can the user try to connect someone from the server?
在禁止的情况下,用户可以尝试从服务器连接某人吗? I thought about contacting customer support, but I heavily suspect they they cannot help with this issue, and won't even understand what it is about.
我想联系客户支持,但我严重怀疑他们无法帮助解决这个问题,甚至不明白这是怎么回事。
I have googled this issue, but not found anything of help.我用谷歌搜索了这个问题,但没有找到任何帮助。 They usually advise to clear cache, memory etc. This is not the problem here.
他们通常建议清除缓存、内存等。这不是这里的问题。 I can access the site via Chrome or other browsers, but when using requests via python, this problem appears.
我可以通过 Chrome 或其他浏览器访问该站点,但是通过 python 使用请求时,会出现此问题。
Pretty sure you need to use a Javascript scraping bot, you can try with this tool: https://docs.python-requests.org/projects/requests-html/en/latest/很确定您需要使用 Javascript 抓取机器人,您可以尝试使用此工具: https : //docs.python-requests.org/projects/requests-html/en/latest/
And to get contact informations about the owner of a website you can use the unix whois command:要获取有关网站所有者的联系信息,您可以使用 unix whois 命令:
whois hm.com
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.