简体   繁体   English

C#线程任务和多个网页下载的性能问题

[英]Performance issue with C# threading task and multiple web page downloads

I'm running code to download a large number of documents from county websites, usually tax statements. 我正在运行代码,以从县级网站下载大量文档,通常是税务报表。 The code I'm running seems fast and efficient in the beginning, and works great until the file count reaches about 200. This is when performance begins to plummet. 开始时,我正在运行的代码似乎快速高效,并且在文件数达到200左右之前效果很好。这是性能开始下降的时候。 If I let it keep running, it still works, but gets to a point where it's painfully slow. 如果我让它继续运行,它仍然可以工作,但是到了缓慢的地步。 I usually have to stop it, figure out which files haven't been downloaded, and start it over. 我通常必须停止它,找出尚未下载的文件,然后重新开始。

Any help on making this faster, more efficient, and smooth (regardless of file count) would be greatly appreciated. 任何帮助使它更快,更有效和更流畅(无论文件数如何)的帮助将不胜感激。

I've been convinced the performance issue has to do with immediately writing the results to an html file.. I've tried storing the results in StringBuilder until the downloads finish, but of course I run out of memory. 我一直坚信性能问题与立即将结果写入html文件有关。我尝试将结果存储在StringBuilder中,直到下载完成,但是当然我的内存不足。

I've also tried adjusting the MaxDegreeOfParallelism, which seemed to make a small impact by lowering it to 5, but the performance problem related to file count still exists. 我还尝试过调整MaxDegreeOfParallelism,将其降低到5似乎影响不大,但是与文件数有关的性能问题仍然存在。

    private void Run_Mass_TaxBillDownload()
    {
        string strTag = null;
        string county = countyName.SelectedItem.ToString() + "-";

        //Converting urlList to uriList...
        List<Uri> uriList = new List<Uri>();
        foreach (string url in TextViewer.Lines)//"TextViewer is a textbox where urls to be downloaded are stored...
        {
            if (url.Length > 5){Uri myUri = new Uri(url.Trim(), UriKind.RelativeOrAbsolute);uriList.Add(myUri);}
        }

        Parallel.ForEach(uriList, new ParallelOptions { MaxDegreeOfParallelism = 5 }, str =>
        {
            using (WebClient client = new WebClient())
            {
                //Extracting taxbill numbers from the url to use as file names in the saved file...
                string FirstString = null;
                string LastString = null;
                if (str.ToString().ToLower().Contains("&tptick")) { FirstString = "&TPTICK="; LastString = "&TPSX="; }
                if (str.ToString().ToLower().Contains("&ticket=")) { FirstString = "&ticket="; LastString = "&ticketsuff="; }
                if (str.ToString().ToLower().Contains("demandbilling")) { FirstString = "&ticketNumber="; LastString = "&ticketSuffix="; }

                //Start downloading...
                client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
                client.DownloadStringCompleted += new DownloadStringCompletedEventHandler(clientTaxBill_DownloadStringCompleted);
                client.DownloadStringAsync(str, county + (Between(str.ToString(), FirstString, LastString)));
            }
        });
    }
    private static void clientTaxBill_DownloadStringCompleted(Object sender, DownloadStringCompletedEventArgs e)
    {
        //Creating Output file....
        string deskTopPath = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
        string outputPath = deskTopPath + "\\Downloaded Tax Bills";
        string errOutputFile = outputPath + "\\errorReport.txt";
        string results = null;
        string taxBillNum = e.UserState as string;

        try
        {
            File.WriteAllText(outputPath + "\\" + taxBillNum + ".html", e.Result.ToString());
        }
        catch
        {
            results = Environment.NewLine + "<<{ERROR}>> NOTHING FOUND FOR" + taxBillNum;
            File.AppendAllText(errOutputFile, results);
        }
    }

If DownloadStringAsync just carries on, then it will run more than 5 downloads at once, DownloadStringCompleted will setup the call back then just continue and loop around again. 如果DownloadStringAsync仅继续进行,那么它将一次运行5次以上的下载, DownloadStringCompleted将设置回调,然后继续并再次循环。

So, it will not be waiting for each one to complete. 因此,它不会等待每个完成。

ActionBlock is your friend as its just works better with async code and couple that with httpClient instead of WebClient ActionBlock是您的朋友,因为它可以更好地与async代码配合使用,并与httpClient而不是WebClient

Try something like this 试试这个

public static async Task Downloader()
{
    var urls = new string[] { "https://www.google.co.uk/", "https://www.microsoft.com/" };

    var ab = new ActionBlock<string>(async (url)  => 
    {
        var httpClient = new HttpClient();
        var httpResponse = await httpClient.GetAsync(url);
        var text = await httpResponse.Content.ReadAsStringAsync();

        // just write it to a file
        Console.WriteLine(text);

    }, new ExecutionDataflowBlockOptions() { MaxDegreeOfParallelism = 5 });

    foreach(var url in urls)
    {
        await ab.SendAsync(url);
    }

    ab.Complete(); 
    await ab.Completion;
    Console.WriteLine("Done");
    Console.ReadKey();
}

MaxDegreeOfParallelism = 5 that says, 5 threads, wait ab.SendAsync(url); MaxDegreeOfParallelism = 5 ,表示有5个线程, wait ab.SendAsync(url); is important as if you want to restrict the buffer size with BoundedCapacity = n this will wait until it has room whereas the ab.Post() method will not, it will just return false if it has no room 这很重要,因为如果您想使用BoundedCapacity = n限制缓冲区大小,它将等待直到有空间,而ab.Post()方法将没有空间,如果没有空间,它将仅返回false

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM