简体   繁体   中英

How to stop NodeJS "Request" module changes request when using proxy

Sorry if this comes off as confusing.

I have written a script using the NodeJS request module that runs and performs a function on a website then returns with the data. This script works perfectly fine when I do not use a proxy by setting it to false. This is not a task that is NOT allowed to be done with Selenium/puppeteer

proxy: false

However, when I set a (working) proxy. It fails to perform the same task and is detected by the website firewall/antibot software.

proxy: http://xx.xxx.xx.xx:3128

Some things to note:

  • I have tried many (20+) different proxy providers (Residential and Datacenter) and they all have this issue
  • The issue does not occur if that proxy is set globally on my system
  • The issue does not occur if that proxy is set in a chrome extension
  • The SSL cipher suites do not match Chrome but they still don't match when not using a proxy so I assume that isn't the issue
  • It is very important to keep consistency in the header order

The question basically is. Does the request module change anything when using a proxy such as the header order?

Here is an image of what happens when it passes/fails. 在此处输入图片说明

The only difference is changing the proxy that causes this to fail. One request being made with, one request being made without.

url    : url,
simple : false,
forever: true,
resolveWithFullResponse: true,
gzip: true,
headers: {
    'Host'             : 'www.sitename.com',
    'Connection'       : 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent'       : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36',
    'Accept'           : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-encoding'  : 'gzip, deflate, br',
    'Accept-Language'  : 'en-GB,en-US;q=0.9,en;q=0.8',
},
method : 'GET',
jar: globalJar,
simple: false,
followRedirect: false,
followAllRedirects: false, 

According to the proxies documentation of the request module:

By default, when proxying http traffic, request will simply make a standard proxied http request. This is done by making the url section of the initial line of the request a fully qualified url to the endpoint.

Instead you can use a http tunnel by setting:

tunnel : true

in the request module proxy settings.

It could be that in your case, you are making a standard proxied http request , whereas when using a proxy globally on your system or a chrome extension a http tunnel is created.

From the documentation:

Note that, when using a tunneling proxy, the proxy-authorization header and any headers from custom proxyHeaderExclusiveList are never sent to the endpoint server, but only to the proxy server.

After deactivating my old account I wanted to come back and give an actual answer to this question now I fully understand the answer. What I was asking one year ago was not possible, The antibot was fingerprinting me through the TLS ClientHello (And even slightly on the TCP/frame level).

To start, I wrote my a wrapper called request-curl which wrapped libcurl/curl binaries into a single library with the same format as request-promise , this gave me much more control over the request (preventing encoding, http2/proxy support and further session/TLS control) this still only let me reach a medicore rank of the 687th most popular ClientHello ( https://client.tlsfingerprint.io:8443/ ). It wasn't good enough.

I had to move language. NodeJS is too much of a high-level language to allow for a really deep control (had to modify packets being sent from Layer 3). So as the answer to my question.

This is not yet possible to do in NodeJS - Let alone with the now unmaintained request.js library.

For anyone reading this, if you want to forge perfect requests to bypass antibot security you must move to a different language: I recommend utls in Golang or BouncyCastle in c#. Godspeed to you as it took me a year to really know how to do this. Even then, there's more internal issues these languages have and features they do not yet supposed (Go doesn't support 'basic' header-ordering, you need to monkey-patch/modify internals etc, utls doesn't easily support proxies). The list goes on and on.

If you're not already too deep into it, it's one hell of a rabbithole and I recommend you do not enter it.

There are some scenarios that I can think of

  • Proxy is actually adding some headers to the final request (in order to identify you to the server)
  • The website you're trying to reach has your proxy IPs blacklisted (public/paid ones?)

It really depends on why you need to use that proxy

  • Is it because of network restrictions?
  • Is it because you want to hide the original request address?

Also, if you have control over the proxy server, can you log the requests being made to the final server?

My suggestion

Try writing your own proxy (a reverse one) and host it somewhere. Instead of requesting to https://target.com , to a request to your http[s]://proxy.com/ and let the reverse proxy do the work. Also, remember to disable X headers on the implementation as it will change the request headers

Reference for node.js implementation:

https://github.com/nodejitsu/node-http-proxy

Note: let me know about the questions I made in the comments

You're using the http -scheme for you request, but if the webserver redirects http to https and if the proxy-server is not configured to accept redirects (to https ) then the problem might only be about the scheme respectively the URL you enter.

So the proxy had to be configured to accept redirects or the URL has to be checked manually in the case of faults and then adjusted in the case of a redirect.

Here you can read about redirects on one proxy-server (Apache Traffic Server), the scenario there includes more redirects than I described above:
https://docs.trafficserver.apache.org/en/4.2.x/admin/reverse-proxy-http-redirects.en.html#handling-origin-server-redirect-responses

If you still encounter problems the server-logs of the proxy-server would be helpful.

EDIT:
According to he page @Jannes Botis linked there exist still more proxy-settings that might be able to support or disrupt the desired functionality, so the whole issue is perhaps about configuring the proxy-server correct. Here are a few settings that are directly related to redirects:

followRedirect - follow HTTP 3xx responses as redirects (default: true). This property can also be implemented as function which gets response object as a single argument and should return true if redirects should continue or false otherwise.
followAllRedirects - follow non-GET HTTP 3xx responses as redirects (default: false)
followOriginalHttpMethod - by default we redirect to HTTP method GET. you can enable this property to redirect to the original HTTP method (default: false)
maxRedirects - the maximum number of redirects to follow (default: 10)
removeRefererHeader - removes the referer header when a redirect happens (default: false). Note: if true, referer header set in the initial request is preserved during redirect chain.

It's quite possible that other settings of the proxy-server have impact on fail or success of your scenario too.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM