简体   繁体   中英

HttpClient with multiple proxies while handling socket exhaustion and DNS recycling

We are working on a fun project with a friend and we have to execute hundreds of HTTP requests, all using different proxies. Imagine that it is something like the following:

for (int i = 0; i < 20; i++)
{
    HttpClientHandler handler = new HttpClientHandler { Proxy = new WebProxy(randomProxy, true) };

    using (var client = new HttpClient(handler))
    {
        using (var request = new HttpRequestMessage(HttpMethod.Get, "http://x.com"))
        {
            var response = await client.SendAsync(request);

            if (response.IsSuccessStatusCode)
            {
                string content = await response.Content.ReadAsStringAsync();
            }
        }

        using (var request2 = new HttpRequestMessage(HttpMethod.Get, "http://x.com/news"))
        {
            var response = await client.SendAsync(request2);

            if (response.IsSuccessStatusCode)
            {
                string content = await response.Content.ReadAsStringAsync();
            }
        }
    }
}

By the way, we are using .NET Core (Console Application for now). I know there are many threads about socket exhaustion and handling DNS recycling, but this particular one is different, because of the multiple proxy usage.

If we use a singleton instance of HttpClient, just like everyone suggests:

  • We can't set more than one proxy, because it is being set during HttpClient's instantiation and cannot be changed afterwards.
  • It doesn't respect DNS changes. Re-using an instance of HttpClient means that it holds on to the socket until it is closed so if you have a DNS record update occurring on the server the client will never know until that socket is closed. One workaround is to set the keep-alive header to false , so the socket will be closed after each request. It leads to a sub-optimal performance. The second way is by using ServicePoint :
ServicePointManager.FindServicePoint("http://x.com")  
    .ConnectionLeaseTimeout = Convert.ToInt32(TimeSpan.FromSeconds(15).TotalMilliseconds);

ServicePointManager.DnsRefreshTimeout = Convert.ToInt32(TimeSpan.FromSeconds(5).TotalMilliseconds);

On the other hand, disposing HttpClient (just like in my example above), in other words multiple instances of HttpClient, is leading to multiple sockets in TIME_WAIT state. TIME_WAIT indicates that local endpoint (this side) has closed the connection.

I'm aware of SocketsHttpHandler and IHttpClientFactory , but they can't solve the different proxies.

var socketsHandler = new SocketsHttpHandler
{
    PooledConnectionLifetime = TimeSpan.FromMinutes(10),
    PooledConnectionIdleTimeout = TimeSpan.FromMinutes(5),
    MaxConnectionsPerServer = 10
};

// Cannot set a different proxy for each request
var client = new HttpClient(socketsHandler);

What is the most sensible decision that can be made?

The point of reusing HttpClient instances (or more specifically, reusing the last HttpMessageHandler ) is to reuse the socket connections. Different proxies mean different socket connections, so it doesn't make sense to try to reuse an HttpClient / HttpMessageHandler on a different proxy, because it would have to be a different connection.

we have to execute hundreds of HTTP requests, all using different proxies

If every request is truly a unique proxy, and no proxies are shared across any other requests, then you may as well just keep the individual HttpClient instances and live with the TIME_WAIT .

However, if multiple requests may go through the same proxy, and you want to re-use those connections, then that is certainly possible.

I would recommend using IHttpClientFactory . It allows you to define named HttpClient instances (again, technically the last HttpMessageHandler instances) that can be pooled and reused. Just make one for each proxy:

var proxies = new Dictionary<string, IWebProxy>(); // TODO: populate with proxies.
foreach (var proxy in proxies)
{
  services.AddHttpClient(proxy.Key)
      .ConfigurePrimaryHttpMessageHandler(() => new HttpClientHandler { Proxy = proxy.Value });
}

The ConfigurePrimaryHttpMessageHandler controls how the IHttpClientFactory creates the primary HttpMessageHandler instances that are pooled. I copied HttpClientHandler from the code in your question, but most modern apps use SocketsHttpHandler , which also has Proxy / UseProxy properties.

Then, when you want to use one, call IHttpClientFactory.CreateClient and pass the name of the HttpClient you want:

for (int i = 0; i < 20; i++)
{
  var client = _httpClientFactory.CreateClient(randomProxyName);
  ...
}

First of all, I want to mention that @Stephen Cleary's example works fine if the proxies are known at compile-time, but in my case they are known at runtime. I forgot to mention that in the question, so it's my fault.

Thanks to @aepot for pointing out those stuff.

That's the solution I came up with (credits @mcont):

/// <summary>
/// A wrapper class for <see cref="FlurlClient"/>, which solves socket exhaustion and DNS recycling.
/// </summary>
public class FlurlClientManager
{
    /// <summary>
    /// Static collection, which stores the clients that are going to be reused.
    /// </summary>
    private static readonly ConcurrentDictionary<string, IFlurlClient> _clients = new ConcurrentDictionary<string, IFlurlClient>();

    /// <summary>
    /// Gets the available clients.
    /// </summary>
    /// <returns></returns>
    public ConcurrentDictionary<string, IFlurlClient> GetClients()
        => _clients;

    /// <summary>
    /// Creates a new client or gets an existing one.
    /// </summary>
    /// <param name="clientName">The client name.</param>
    /// <param name="proxy">The proxy URL.</param>
    /// <returns>The <see cref="FlurlClient"/>.</returns>
    public IFlurlClient CreateOrGetClient(string clientName, string proxy = null)
    {
        return _clients.AddOrUpdate(clientName, CreateClient(proxy), (_, client) =>
        {
            return client.IsDisposed ? CreateClient(proxy) : client;
        });
    }

    /// <summary>
    /// Disposes a client. This leaves a socket in TIME_WAIT state for 240 seconds but it's necessary in case a client has to be removed from the list.
    /// </summary>
    /// <param name="clientName">The client name.</param>
    /// <returns>Returns true if the operation is successful.</returns>
    public bool DeleteClient(string clientName)
    {
        var client = _clients[clientName];
        client.Dispose();
        return _clients.TryRemove(clientName, out _);
    }

    private IFlurlClient CreateClient(string proxy = null)
    {
        var handler = new SocketsHttpHandler()
        {
            Proxy = proxy != null ? new WebProxy(proxy, true) : null,
            PooledConnectionLifetime = TimeSpan.FromMinutes(10)
        };

        var client = new HttpClient(handler);

        return new FlurlClient(client);
    }
}

A proxy per request means an additional socket for each request (another HttpClient instance).

In the solution above, ConcurrentDictionary is used to store the HttpClients, so I can reuse them, which is the exact point of HttpClient. I could use same proxy for 5 requests, before it gets blocked by API limitations. I forgot to mention that in the question as well.

As you've seen, there are two solutions solving socket exhaustion and DNS recycling: IHttpClientFactory and SocketsHttpHandler . The first one doesn't suit my case, because the proxies I'm using are known at runtime, not at compile-time. The solution above uses the second way.

For those who have same issue, you can read the following issue on GitHub. It explains everything.

I'm open-minded for improvements, so poke me.

Collected my comments into the answer. But these are improvement suggestions, not a solution because your question is strongly context-dependent: how many proxies, how many requests per minute, what is average time of each request, etc.

Disclamer: I'm not familiar with IHttpClientFactory but afaik, it's the only way to solve the Socket exhaustion and DNS problem.

Note: ServicePointManager doesn't affect HttpClient in .NET Core because it intended for use with HttpWebRequest which is not used by HttpClient in .NET Core.

As suggested by @GuruStron, HttpClient instance per proxy looks like reasonable solution.

HttpResponseMessage is IDisposable . Apply using statement for it. It will affect sockets utilization behavior.

You may apply HttpCompletionOption.ResponseHeadersRead to SendAsync for not reading the whole response on sending the request. Then you may not read the response if server returned not successful Status Code.

To improve the internal performance you may also append .ConfigureAwait(false) at SendAsync() and ReadAsStringAsync() lines. It's mostly useful if current SynchronizationContext is not null (eg it's not a Console app).

Here's somewhat optimized code (C# 8.0):

private static async Task<string> GetHttpResponseAsync(HttpClient client, string url)
{
    using HttpResponseMessage response = await client.GetAsync(url, HttpCompletionOption.ResponseHeadersRead).ConfigureAwait(false);
    if (response.IsSuccessStatusCode)
    {
        return await response.Content.ReadAsStringAsync().ConfigureAwait(false);
    }
    return null;
}

Pass pooled HttpClient and URL to the method.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM