简体   繁体   English

ConcurrentBag 跳过一些项目 C#

[英]ConcurrentBag Skiping some items C#

I am using concurrentbag for scraping URLs , Right now its working fine for 500 / 100 urls but when I am trying to scrape 8000 urls .我正在使用 concurrentbag 来抓取 URL,现在它可以很好地处理 500 / 100 个 url,但是当我尝试抓取 8000 个 url 时。 All URLs not processing and some items pending in inputQueue.所有未处理的 URL 和 inputQueue 中的一些待处理项目。

But I am using while (!inputQueue.IsEmpty) .但我正在使用 while (!inputQueue.IsEmpty) 。 So, it should run loop till any items exists into inputqueue.因此,它应该循环运行,直到输入队列中存在任何项目。

I want only run 100 threads max.我只想最多运行 100 个线程。 So, I first creating 100 threads and calling "Run()" method and inside that method I am running a loop to take items till items exits in inputqueue and add into output queue after scraping urls.所以,我首先创建 100 个线程并调用“Run()”方法,在该方法中我运行一个循环来获取项目,直到项目退出输入队列并在抓取 url 后添加到输出队列。

public ConcurrentBag<Data> inputQueue = new ConcurrentBag<Data>();
    public ConcurrentBag<Data> outPutQueue = new ConcurrentBag<Data>();

    public List<Data> Scrapes(List<Data> scrapeRequests)
    {
        ServicePointManager.ServerCertificateValidationCallback += (sender, cert, chain, sslPolicyErrors) => true;
        string proxy_session_id = new Random().Next().ToString();

        numberOfRequestSent = 0;

        watch.Start();

        foreach (var sRequest in scrapeRequests)
        {
            inputQueue.Add(sRequest);
        }
        //inputQueue.CompleteAdding();

        var taskList = new List<Task>();
        for (var i = 0; i < n_parallel_exit_nodes; i++) //create 100 threads only
        {
            taskList.Add(Task.Factory.StartNew(async () =>
            {
               await Run();
            }, TaskCreationOptions.RunContinuationsAsynchronously));
        }

        Task.WaitAll(taskList.ToArray());   //Waiting

        //print result
        Console.WriteLine("Number Of URLs Found - {0}", scrapeRequests.Count);
        Console.WriteLine("Number Of Request Sent - {0}", numberOfRequestSent);

        Console.WriteLine("Input Queue - {0}", inputQueue.Count);

        Console.WriteLine("OutPut Queue - {0}", outPutQueue.ToList().Count);
        Console.WriteLine("Success - {0}", outPutQueue.ToList().Where(x=>x.IsProxySuccess==true).Count().ToString());
        Console.WriteLine("Failed - {0}", outPutQueue.ToList().Where(x => x.IsProxySuccess == false).Count().ToString());
        Console.WriteLine("Process Time In - {0}", watch.Elapsed);

        return outPutQueue.ToList();
    }


    async Task<string> Run()
    {
        while (!inputQueue.IsEmpty)
        {
            var client = new Client(super_proxy_ip, "US");

            if (!client.have_good_super_proxy())
                client.switch_session_id();
            if (client.n_req_for_exit_node == switch_ip_every_n_req)
                client.switch_session_id();

            var scrapeRequest = new ProductResearch_ProData();
            inputQueue.TryTake(out scrapeRequest);

            try
            {
                numberOfRequestSent++;

                // Console.WriteLine("Sending request for - {0}", scrapeRequest.URL);
                scrapeRequest.HTML = client.DownloadString((string)scrapeRequest.URL);
                //Console.WriteLine("Response done for - {0}", scrapeRequest.URL);

                scrapeRequest.IsProxySuccess = true;

                outPutQueue.Add(scrapeRequest); //add object to output queue

                //lumanti code
                client.handle_response();
            }
            catch (WebException e)
            {
                Console.WriteLine("Failed");

                scrapeRequest.IsProxySuccess = false;
                Console.WriteLine(e.Message);
                outPutQueue.Add(scrapeRequest); //add object to output queue

                //lumanti code
                client.handle_response(e);
            }

            client.clean_connection_pool();
            client.Dispose();
        }

        return await Task.Run(() => "Done");
    }

There are multiple problems here, but none of them seems to be the cause for the inputQueue.Count having a none-zero value at the end.这里有多个问题,但它们似乎都不是inputQueue.Count具有非零值的原因。 In any case I would like to point at the problems I can see.无论如何,我想指出我能看到的问题。

var taskList = new List<Task>();
for (var i = 0; i < n_parallel_exit_nodes; i++) // create 100 threads only
{
    taskList.Add(Task.Factory.StartNew(async () =>
    {
        await Run();
    }, TaskCreationOptions.RunContinuationsAsynchronously));
}

The method Task.Factory.StartNew doesn't understand async delegates, so when it is called with an async lambda as argument it returns a nested task.方法Task.Factory.StartNew不理解异步委托,因此当使用异步 lambda 作为参数调用它时,它返回一个嵌套任务。 In this case it returns a Task<Task<string>> .在这种情况下,它返回一个Task<Task<string>> You store this nested task in List<Task> collection, which is possible because the type Task<TResult> inherits from the type Task , but doing so you lose the ability to await for the completion (and get the result) of the inner task.您将此嵌套任务存储在List<Task>集合中,这是可能的,因为类型Task<TResult>继承自类型Task ,但这样做您将失去等待内部任务完成(并获得结果)的能力. You only hold a reference to the outer task.您只持有对外部任务的引用。 Miraculously this is not a problem in this case (it usually is) since the outer task does all the work, and the inner task does essentially nothing (other than using a thread-pool thread to return a "Done" string that is not really needed anywhere).奇迹般地,在这种情况下这不是问题(通常是),因为外部任务完成了所有工作,而内部任务基本上什么都不做(除了使用线程池线程返回一个并非真正意义上的"Done"字符串)任何地方都需要)。

You also don't attach any continuations to the outer tasks, so the flag TaskCreationOptions.RunContinuationsAsynchronously seems redundant.您也不TaskCreationOptions.RunContinuationsAsynchronously任何延续附加到外部任务,因此标志TaskCreationOptions.RunContinuationsAsynchronously似乎是多余的。

// create 100 threads only

You don't create 100 threads, you create 100 tasks.您不创建 100 个线程,而是创建 100 个任务。 These tasks are scheduled in the ThreadPool , which will be immediately starved because the tasks are long-running, and will start injecting one new thread every 500 msec until all scheduled tasks have been assigned to a thread.这些任务被安排在ThreadPool ,由于任务长时间运行,它将立即被饿死,并将每 500 毫秒开始注入一个新线程,直到所有计划任务都分配给一个线程。

var scrapeRequest = new ProductResearch_ProData();
inputQueue.TryTake(out scrapeRequest);

Here you instantiate an object of type ProductResearch_ProData that is immediately discarded and becomes eligible for garbage collection in the very next line.在这里,您实例化了一个ProductResearch_ProData类型的对象,该对象会立即被丢弃并在下一行中成为垃圾收集对象。 The TryTake method will either return an object removed from the bag, or null if the bag is empty. TryTake方法将返回从包中移除的对象,如果包为空,则返回null You ignore the return value of the TryTake method, which is entirely possible to be false because meanwhile the bag may have been emptied by another worker, and then proceed with a scrapeRequest that has possibly a null value, resulting in that case to a NullReferenceException .你忽略了TryTake方法的返回值,这完全有可能是false因为与此同时包可能已经被另一个工作人员清空,然后继续执行一个可能为空值的scrapeRequest ,导致这种情况出现NullReferenceException

Worth noting that you extract an object of type ProductResearch_ProData from a ConcurrentBag<Data> , so either the class Data inherits from the base class ProductResearch_ProData , or there is a transcription error in the code.值得注意的是,您从ConcurrentBag<Data>提取了ProductResearch_ProData类型的对象,因此类DataProductResearch_ProDataProductResearch_ProData ,或者代码中存在转录错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM