简体   繁体   English

使用代理时如何停止NodeJS“请求”模块更改请求

[英]How to stop NodeJS "Request" module changes request when using proxy

Sorry if this comes off as confusing.对不起,如果这令人困惑。

I have written a script using the NodeJS request module that runs and performs a function on a website then returns with the data.我已经使用 NodeJS 请求模块编写了一个脚本,该模块在网站上运行并执行一个功能,然后返回数据。 This script works perfectly fine when I do not use a proxy by setting it to false.当我不通过将代理设置为 false 来使用代理时,此脚本工作得非常好。 This is not a task that is NOT allowed to be done with Selenium/puppeteer这不是一项不允许使用 Selenium/puppeteer 完成的任务

proxy: false

However, when I set a (working) proxy.但是,当我设置(工作)代理时。 It fails to perform the same task and is detected by the website firewall/antibot software.它无法执行相同的任务并被网站防火墙/antibot 软件检测到。

proxy: http://xx.xxx.xx.xx:3128

Some things to note:一些注意事项:

  • I have tried many (20+) different proxy providers (Residential and Datacenter) and they all have this issue我尝试了许多(20 多个)不同的代理提供商(住宅和数据中心),他们都有这个问题
  • The issue does not occur if that proxy is set globally on my system如果在我的系统上全局设置该代理,则不会出现此问题
  • The issue does not occur if that proxy is set in a chrome extension如果在 chrome 扩展中设置了该代理,则不会出现此问题
  • The SSL cipher suites do not match Chrome but they still don't match when not using a proxy so I assume that isn't the issue SSL 密码套件与 Chrome 不匹配,但在不使用代理时它们仍然不匹配,所以我认为这不是问题
  • It is very important to keep consistency in the header order保持标题顺序的一致性非常重要

The question basically is.问题基本上是。 Does the request module change anything when using a proxy such as the header order?使用代理时请求模块是否会更改任何内容,例如标头顺序?

Here is an image of what happens when it passes/fails.这是通过/失败时发生的情况的图像。 在此处输入图片说明

The only difference is changing the proxy that causes this to fail.唯一的区别是更改导致此失败的代理。 One request being made with, one request being made without.一项请求被提出,一项请求被没有提出。

url    : url,
simple : false,
forever: true,
resolveWithFullResponse: true,
gzip: true,
headers: {
    'Host'             : 'www.sitename.com',
    'Connection'       : 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent'       : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36',
    'Accept'           : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-encoding'  : 'gzip, deflate, br',
    'Accept-Language'  : 'en-GB,en-US;q=0.9,en;q=0.8',
},
method : 'GET',
jar: globalJar,
simple: false,
followRedirect: false,
followAllRedirects: false, 

According to the proxies documentation of the request module:根据请求模块的代理文档

By default, when proxying http traffic, request will simply make a standard proxied http request.默认情况下,在代理 http 流量时,请求将简单地发出一个标准的代理 http 请求。 This is done by making the url section of the initial line of the request a fully qualified url to the endpoint.这是通过使请求的初始行的 url 部分成为端点的完全限定 url 来完成的。

Instead you can use a http tunnel by setting:相反,您可以通过设置使用http 隧道

tunnel : true

in the request module proxy settings.在请求模块代理设置中。

It could be that in your case, you are making a standard proxied http request , whereas when using a proxy globally on your system or a chrome extension a http tunnel is created.可能是在您的情况下,您正在发出一个标准的代理 http 请求,而在您的系统上全局使用代理或 chrome 扩展程序时,会创建一个http 隧道

From the documentation:从文档:

Note that, when using a tunneling proxy, the proxy-authorization header and any headers from custom proxyHeaderExclusiveList are never sent to the endpoint server, but only to the proxy server.请注意,当使用隧道代理时,代理授权标头和来自自定义 proxyHeaderExclusiveList 的任何标头永远不会发送到端点服务器,而只会发送到代理服务器。

After deactivating my old account I wanted to come back and give an actual answer to this question now I fully understand the answer.停用我的旧帐户后,我想回来并实际回答这个问题,现在我完全理解了答案。 What I was asking one year ago was not possible, The antibot was fingerprinting me through the TLS ClientHello (And even slightly on the TCP/frame level).一年前我问的是不可能的,antibot 正在通过 TLS ClientHello 对我进行指纹识别(甚至在 TCP/帧级别上也有一些)。

To start, I wrote my a wrapper called request-curl which wrapped libcurl/curl binaries into a single library with the same format as request-promise , this gave me much more control over the request (preventing encoding, http2/proxy support and further session/TLS control) this still only let me reach a medicore rank of the 687th most popular ClientHello ( https://client.tlsfingerprint.io:8443/ ).首先,我编写了一个名为request-curl的包装器,它将 libcurl/curl 二进制文件包装到一个与request-promise格式相同的库中,这让我可以更好地控制请求(防止编码、http2/proxy 支持和进一步session/TLS 控制)这仍然只让我达到第 687 位最受欢迎的 ClientHello ( https://client.tlsfingerprint.io:8443/ ) 的中等排名。 It wasn't good enough.这还不够好。

I had to move language.我不得不移动语言。 NodeJS is too much of a high-level language to allow for a really deep control (had to modify packets being sent from Layer 3). NodeJS 是一种高级语言,无法进行真正深入的控制(必须修改从第 3 层发送的数据包)。 So as the answer to my question.所以作为我问题的答案。

This is not yet possible to do in NodeJS - Let alone with the now unmaintained request.js library.这在 NodeJS 中无法做到——更不用说现在没有维护的 request.js 库了。

For anyone reading this, if you want to forge perfect requests to bypass antibot security you must move to a different language: I recommend utls in Golang or BouncyCastle in c#.对于阅读本文的任何人,如果您想伪造绕过反机器人安全性的完美请求,您必须改用另一种语言:我推荐 Golang 中的 utls 或 c# 中的 BouncyCastle。 Godspeed to you as it took me a year to really know how to do this.祝你好运,因为我花了一年时间才真正知道如何做到这一点。 Even then, there's more internal issues these languages have and features they do not yet supposed (Go doesn't support 'basic' header-ordering, you need to monkey-patch/modify internals etc, utls doesn't easily support proxies).即便如此,这些语言还有更多的内部问题和他们还没有想到的功能(Go 不支持“基本”标头排序,你需要猴子补丁/修改内部等,utls 不容易支持代理)。 The list goes on and on.这份清单不胜枚举。

If you're not already too deep into it, it's one hell of a rabbithole and I recommend you do not enter it.如果您还没有深入了解它,那简直就是一个地狱,我建议您不要进入它。

There are some scenarios that I can think of有一些场景我能想到

  • Proxy is actually adding some headers to the final request (in order to identify you to the server)代理实际上是在最终请求中添加了一些标头(为了向服务器识别您的身份)
  • The website you're trying to reach has your proxy IPs blacklisted (public/paid ones?)您尝试访问的网站将您的代理 IP 列入黑名单(公共/付费的?)

It really depends on why you need to use that proxy这实际上取决于您为什么需要使用该代理

  • Is it because of network restrictions?是不是因为网络限制?
  • Is it because you want to hide the original request address?是不是因为要隐藏原来的请求地址?

Also, if you have control over the proxy server, can you log the requests being made to the final server?另外,如果您可以控制代理服务器,您能否记录对最终服务器的请求?

My suggestion我的建议

Try writing your own proxy (a reverse one) and host it somewhere.尝试编写自己的代理(反向代理)并将其托管在某个地方。 Instead of requesting to https://target.com , to a request to your http[s]://proxy.com/ and let the reverse proxy do the work.不是请求https://target.com ,而是请求您的 http[s]://proxy.com/ 并让反向代理完成工作。 Also, remember to disable X headers on the implementation as it will change the request headers另外,请记住在实现上禁用 X 标头,因为它会更改请求标头

Reference for node.js implementation: node.js 实现参考:

https://github.com/nodejitsu/node-http-proxy https://github.com/nodejitsu/node-http-proxy

Note: let me know about the questions I made in the comments注意:让我知道我在评论中提出的问题

You're using the http -scheme for you request, but if the webserver redirects http to https and if the proxy-server is not configured to accept redirects (to https ) then the problem might only be about the scheme respectively the URL you enter.您正在为您的请求使用http -scheme,但如果网络服务器将http重定向到https并且代理服务器未配置为接受重定向(到https ),那么问题可能仅与您输入的 URL 和方案有关.

So the proxy had to be configured to accept redirects or the URL has to be checked manually in the case of faults and then adjusted in the case of a redirect.因此必须将代理配置为接受重定向,或者在出现故障时必须手动检查 URL,然后在重定向的情况下进行调整。

Here you can read about redirects on one proxy-server (Apache Traffic Server), the scenario there includes more redirects than I described above:在这里,您可以阅读有关一台代理服务器(Apache Traffic Server)上的重定向的信息,那里的场景包括比我上面描述的更多的重定向:
https://docs.trafficserver.apache.org/en/4.2.x/admin/reverse-proxy-http-redirects.en.html#handling-origin-server-redirect-responses https://docs.trafficserver.apache.org/en/4.2.x/admin/reverse-proxy-http-redirects.en.html#handling-origin-server-redirect-responses

If you still encounter problems the server-logs of the proxy-server would be helpful.如果您仍然遇到问题,代理服务器的服务器日志会有所帮助。

EDIT:编辑:
According to he page @Jannes Botis linked there exist still more proxy-settings that might be able to support or disrupt the desired functionality, so the whole issue is perhaps about configuring the proxy-server correct.根据他的页面@Jannes Botis 链接,还有更多代理设置可能能够支持或破坏所需的功能,所以整个问题可能与正确配置代理服务器有关。 Here are a few settings that are directly related to redirects:以下是一些与重定向直接相关的设置:

followRedirect - follow HTTP 3xx responses as redirects (default: true). This property can also be implemented as function which gets response object as a single argument and should return true if redirects should continue or false otherwise.
followAllRedirects - follow non-GET HTTP 3xx responses as redirects (default: false)
followOriginalHttpMethod - by default we redirect to HTTP method GET. you can enable this property to redirect to the original HTTP method (default: false)
maxRedirects - the maximum number of redirects to follow (default: 10)
removeRefererHeader - removes the referer header when a redirect happens (default: false). Note: if true, referer header set in the initial request is preserved during redirect chain.

It's quite possible that other settings of the proxy-server have impact on fail or success of your scenario too.代理服务器的其他设置很可能也会影响您的方案的失败或成功。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM