简体   繁体   English

网页抓取时访问被拒绝 403; 该怎么办?

[英]Access Denied 403 when webscraping; What to do?

I was testing a scraping algorithm that I had built.我正在测试我构建的抓取算法。 I made a request to https://www2.hm.com/fi_fi/miesten.html but misspecified the user-agent information.我向https://www2.hm.com/fi_fi/miesten.html提出请求,但错误指定了用户代理信息。 It seems that this triggered an immediate ban (not sure) Scraping their site should be fine - their robots.txt says: User-agent: * Disallow: )似乎这触发了立即禁止(不确定)抓取他们的网站应该没问题-他们的 robots.txt 说:用户代理:* 禁止:)

Example of making a request to HM and the subsequent server response向 HM 发出请求和后续服务器响应的示例

I erased the user agent and proxy information due to privacy concerns.由于隐私问题,我删除了用户代理和代理信息。 However, they are nothing out of the ordinary.然而,它们并没有什么不同寻常。

I receive the following as response:我收到以下回复:

"b'\\nAccess Denied\\n\\n "b'\\n访问被拒绝\\n\\n

\\n \\nYou don't have permission to access "http://www2.hm.com/fi_fi/miesten.html" on this server. \\n \\n您无权访问此服务器上的“http://www2.hm.com/fi_fi/miesten.html”。

\\nReference #18.2796ef50.1625728417.f9aab80\\n\\n\\n'" \\n参考 #18.2796ef50.1625728417.f9aab80\\n\\n\\n'"

So my question is: is there anything that I can do to lift this ban?所以我的问题是:我能做些什么来解除这个禁令? Can i connect someone from their end and ask to lift it?我可以从他们的一端连接某人并要求解除它吗? If so, where can this information usually be found.如果是这样,通常在哪里可以找到这些信息。 Although this question concern this site in particular, this is a much broader question.虽然这个问题特别关注这个网站,但这是一个更广泛的问题。 In the case of a ban, can the user try to connect someone from the server?在禁止的情况下,用户可以尝试从服务器连接某人吗? I thought about contacting customer support, but I heavily suspect they they cannot help with this issue, and won't even understand what it is about.我想联系客户支持,但我严重怀疑他们无法帮助解决这个问题,甚至不明白这是怎么回事。

I have googled this issue, but not found anything of help.我用谷歌搜索了这个问题,但没有找到任何帮助。 They usually advise to clear cache, memory etc. This is not the problem here.他们通常建议清除缓存、内存等。这不是这里的问题。 I can access the site via Chrome or other browsers, but when using requests via python, this problem appears.我可以通过 Chrome 或其他浏览器访问该站点,但是通过 python 使用请求时,会出现此问题。

Pretty sure you need to use a Javascript scraping bot, you can try with this tool: https://docs.python-requests.org/projects/requests-html/en/latest/很确定您需要使用 Javascript 抓取机器人,您可以尝试使用此工具: https : //docs.python-requests.org/projects/requests-html/en/latest/

And to get contact informations about the owner of a website you can use the unix whois command:要获取有关网站所有者的联系信息,您可以使用 unix whois 命令:

whois hm.com

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python-POST请求时拒绝403访问 - Python - 403 Access Denied When POST Request 使用 BeautifulSoup python 访问站点时访问被拒绝 [403] - Access denied [403] when accessing site with BeautifulSoup python 获取禁止:403 访问被拒绝,当请求使用 python 将数据从谷歌云存储传输到 bigquery 时 - get Forbidden: 403 Access Denied when do request to transfer data from google cloud storage to bigquery using python 当我尝试在 python 中设置我自己的 VENV 时访问被拒绝,我该怎么办? - Access Denied when I tried setting my own VENV in python, what do I do? 使用 Python 请求进行 Webscraping,即使在更新标头后也被拒绝访问 - Webscraping with Python Requests and getting Access Denied even after updating headers 使用用户代理标头时 Webscraping CrunchBase 访问被拒绝 - Webscraping CrunchBase Access Denied while using User Agent Header 网页抓取时出现 Python 错误 HTTP 错误 403:禁止 - Python error when webscraping HTTP Error 403: Forbidden Azure Active Directory 返回 403 作为访问被拒绝 - Azure Active directory return 403 as Access denied HTTP 请求中的 403 Access Denied 状态代码 - 403 Access Denied status code in a HTTP request 从 PubSub 访问 AutoML 的 403 权限被拒绝 - 403 permission denied to access AutoML from PubSub
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM