简体   繁体   English

python requests.get 总是得到 404

[英]python requests.get always get 404

I would like to try send requests.get to this website :我想尝试将 requests.get 发送到此网站

requests.get('https://rent.591.com.tw')

and I always get我总是得到

<Response [404]>

I knew this is a common problem and tried different way but still failed.我知道这是一个常见问题并尝试了不同的方法但仍然失败。 but all of other website is ok.但所有其他网站都可以。

any suggestion?有什么建议吗?

Webservers are black boxes.网络服务器是黑匣子。 They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick.他们可以根据您的请求、一天中的时间、月相或他们选择的任何其他标准返回任何有效的 HTTP 响应。 If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.如果另一个 HTTP 客户端得到不同的响应,请始终尝试找出 Python 发送的请求与另一个客户端发送的请求之间的差异。

That means you need to:这意味着您需要:

  • Record all aspects of the working request记录工作请求的所有方面
  • Record all aspects of the failing request记录失败请求的所有方面
  • Try out what changes you can make to make the failing request more like the working request, and minimise those changes.尝试您可以进行哪些更改以使失败的请求更像工作请求,并尽量减少这些更改。

I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.我通常将我的请求指向http://httpbin.org端点,让它记录请求,然后进行实验。

For requests , there are several headers that are set automatically, and many of these you would not normally expect to have to change:对于requests ,有几个标头是自动设置的,其中许多标头通常不需要更改:

  • Host ; Host ; this must be set to the hostname you are contacting, so that it can properly multi-host different sites.必须设置为您正在联系的主机名,以便它可以正确地多托管不同的站点。 requests sets this one. requests设置了这个。
  • Content-Length and Content-Type , for POST requests, are usually set from the arguments you pass to requests . Content-LengthContent-Type ,对于 POST 请求,通常根据您传递给requests的参数设置。 If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests ).如果这些不匹配,请更改您传递给requests的参数(但要注意multipart/*请求,它使用记录在Content-Type标头中的生成边界;将其生成留给requests )。
  • Connection : leave this to the client to manage Connection :把这个留给客户端来管理
  • Cookies : these are often set on an initial GET request, or after first logging into the site. Cookies :这些通常在初始 GET 请求时设置,或在首次登录站点后设置。 Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).确保您使用requests.Session()对象捕获 cookie 并且您已登录(以与浏览器相同的方式提供凭据)。

Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue.其他一切都是公平的游戏,但如果requests设置了默认值,那么这些默认值通常不是问题。 That said, I usually start with the User-Agent header and work my way up from there.也就是说,我通常从 User-Agent 标头开始,然后从那里开始。

In this case, the site is filtering on the user agent, it looks like they are blacklisting Python , setting it to almost any other value already works:在这种情况下,该站点正在对用户代理进行过滤,看起来他们将Python列入黑名单,将其设置为几乎任何其他值已经有效:

>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>

Next, you need to take into account that requests is not a browser .接下来,您需要考虑到requests不是 browser requests is only a HTTP client, a browser does much, much more. requests只是一个 HTTP 客户端,浏览器可以做很多很多事情。 A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts.浏览器解析 HTML 以获取附加资源,例如图像、字体、样式和脚本,也加载这些附加资源并执行脚本。 Scripts can then alter what the browser displays and load additional resources.然后脚本可以更改浏览器显示的内容并加载其他资源。 If your requests results don't match what you see in the browser, but the initial request the browser makes matches , then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed.如果您的requests结果与您在浏览器中看到的不匹配,但浏览器发出初始请求匹配,那么您需要弄清楚浏览器加载了哪些其他资源,并根据需要对requests发出其他请求。 If all else fails, use a project like requests-html , which lets you run a URL through an actual, headless Chromium browser.如果一切都失败了,请使用requests-html类的项目,它允许您通过实际的无头 Chromium 浏览器运行 URL。

The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1 , take that into account if you are trying to scrape data from this site.您尝试联系的网站向https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1发出额外的 AJAX 请求,如果您是试图从该站点抓取数据。

Next, well-built sites will use security best-practices such as CSRF tokens , which require you to make requests in the right order (eg a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.接下来,构建良好的站点将使用安全最佳实践,例如CSRF 令牌,这要求您以正确的顺序发出请求(例如,在向处理程序发送 POST 之前检索表单的 GET 请求)并处理 cookie 或以其他方式提取服务器期望从一个请求传递到另一个请求的额外信息。

Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use.最后但并非最不重要的一点是,如果一个站点阻止脚本发出请求,他们可能要么试图强制执行禁止抓取的服务条款,要么因为他们有一个 API,他们宁愿让你使用。 Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.检查其中任何一个,并考虑到如果您继续抓取网站,您可能会更有效地被阻止。

In my case this was due to fact that the website address was recently changed, and I was provided the old website address.就我而言,这是由于最近更改了网站地址,并且向我提供了旧网站地址。 At least this changed the status code from 404 to 500, which, I think, is progress :)至少这将状态代码从 404 更改为 500,我认为这是进步:)

One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file.需要注意的一件事:我正在使用requests.get()对我从文件中读取的链接进行一些网络抓取。 What I didn't realise was that the links had a newline character ( \\n ) when I read each line from the file.我没有意识到的是,当我从文件中读取每一行时,链接有一个换行符 ( \\n )。

If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \\r or \\n characters before you call requests.get("your link") .如果您从文件中获取多个链接而不是像字符串这样的 Python 数据类型,请确保在调用requests.get("your link")之前去除任何\\r\\n字符。 In my case, I used就我而言,我使用了

with open("filepath", 'w') as file:
   links = file.read().splitlines()
   for link in links:
      response = requests.get(link)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM