简体   繁体   中英

Too much time difference between receiving response from server and writing it into log while reading file from Google Cloud Storage using HttpClient

I need to download multiple files from GCS. For this I have used the code

public class GCSStorage 
{
    static HttpClient httpClient;
    static GoogleCredential credential = GoogleCredential.FromFile(ConfigurationManager.AppSettings["GCPCredentials"]);
    if (credential.IsCreateScopedRequired)
    {
        credential = credential.CreateScoped(new[]
        {
           "https://www.googleapis.com/auth/devstorage.read_only"

        });
        httpClient = new Google.Apis.Http.HttpClientFactory()
                        .CreateHttpClient(
                        new Google.Apis.Http.CreateHttpClientArgs()
                        {
                            ApplicationName = "",
                            GZipEnabled = true,
                            Initializers = { credential },
                        });
        httpClient.Timeout = new TimeSpan(0, 0, 5);
    }

    public string ReadObjectData(string bucketName, string location)
    {
        string responseBody = "";
        bool isFetched = false;
        try
        {
            Stopwatch sw = new Stopwatch();
            string pathcode = System.Web.HttpUtility.UrlEncode(location);
            UriBuilder uri = new UriBuilder(string.Format(googleStorageApi, bucketName, pathcode));
            sw.Start();
            var httpResponseMessage = httpClient.GetAsync(uri.Uri).Result;
            var t = sw.ElapsedMilliseconds;
            if (httpResponseMessage.StatusCode == HttpStatusCode.OK)
            {
                responseBody = httpResponseMessage.Content.ReadAsStringAsync().Result;
                log.Info($"Read file from location : {location} in Get() time : {t} ms , ReadAsString time :  {sw.ElapsedMilliseconds - t} ms, Total time : {sw.ElapsedMilliseconds} ms");
            }
            isFetched = true;
        }
        catch (Exception ex)
        {
            throw ex;
        }
        return responseBody;
    }
}

And called that for multiple files using

GCSStorage gcs = new GCSStorage();
ParallelOptions option = new ParallelOptions { MaxDegreeOfParallelism = options };
    Parallel.ForEach(myFiles, option, ri =>
    {
        text = gcs.ReadObjectData(bucket, ri); ;
    });

I am recording the time taken for each individual file to download in ReadObjectData(). When I download the files using MaxDegreeOfParallelism as 1, then each file is downloaded in about 100-150ms. But when I change MaxDegreeOfParallelism to 50, time varies between 1-3s. I am downloading a bunch of 50 files.

I have no idea why this is happening. Can anyone help me understand this behavior.

Also, I have tried doing the same with Amazon S3. S3 gives a constant download time of 50-100ms in both scenarios.

I profiled the GCS response using fiddler. For the requests that are taking time (~>200ms), Overall Elapsed is around 100-200 ms but the time to write the log is much higher. For others it is exactly at the same time. Why would the time there would be so much time difference b/w some of the requests?

Fiddler Statistics

Request Count:   1
Bytes Sent:      439        (headers:439; body:0)
Bytes Received:  7,759      (headers:609; body:7,150)

ACTUAL PERFORMANCE
--------------
ClientConnected:    18:03:35.137
ClientBeginRequest: 18:04:13.606
GotRequestHeaders:  18:04:13.606
ClientDoneRequest:  18:04:13.606
Determine Gateway:  0ms
DNS Lookup:         0ms
TCP/IP Connect: 0ms
HTTPS Handshake:    0ms
ServerConnected:    18:03:35.152
FiddlerBeginRequest:    18:04:13.606
ServerGotRequest:   18:04:13.606
ServerBeginResponse:    18:04:13.700
GotResponseHeaders: 18:04:13.700
ServerDoneResponse: 18:04:13.700
ClientBeginResponse:    18:04:13.700
ClientDoneResponse: 18:04:13.700

    Overall Elapsed:    0:00:00.093

Log file

INFO  2018-08-25 18:04:13,606 41781ms GCSStorage ReadObjectData -  Get() time : 114 ms 
INFO  2018-08-25 18:04:14,512 42688ms GCSStorage ReadObjectData -  Get() time : 902 ms 

I could see that

LogTime - ClientDoneResponse + Overall Elapsed is approximately equal to Total Time
18:04:14.512 - 18:04:13.700 + 0:00:00.093 = 905 ms

Why is there so much time difference b/w receiving the response from server and writing it into the log?

When you are doing parallel programming, with multiple threads you need to have a few things in mind. First of all it is true that parallelism improves performance, but it is not that infinite parallelilsm is better than sequential . There are many reasons for this. One is you are limited by t he number of your physical cores and also hyper threading in your OS. For example if you have 8 cores, the best performance you will get is with 8 threads, if hyperthreading is also active, then it might be that with 16 threads you get a good performance.

In your example changing number of threads from 1 to 50 is too much. Try it in steps, 2, 4, 6, 8 , 10 and see when you get the best performance (record the time as you have done so far).

That number then is the best number for your parallelism most likely.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM