繁体   English   中英

将项添加到ConcurrentBag用于Parallel.ForEach c#

[英]Add items to ConcurrentBag used to Parallel.ForEach c#

我正在尝试同时抓取多个网址。 每个请求都可以向ConcurrentBag添加更多URL以进行爬网。 目前我有一个令人讨厌的时候(真实)启动一个新的Parallel.ForEach来处理任何新的URL。

有没有什么方法可以添加到ConcurrentBag的内容,所以Parallel.ForEach将看到其中有新项目并继续迭代这些新项目?

ConcurrentBag<LinkObject> URLSToCheck = new ConcurrentBag<LinkObject>();

while (true)
{
    Parallel.ForEach(URLSToCheck, new ParallelOptions { MaxDegreeOfParallelism = 5 }, URL =>
    {
        Checker Checker = new Checker();

        URLDownloadResult result = Checker.downloadFullURL(URL.destinationURL);

        List<LinkObject> URLsToAdd = Checker.findInternalUrls(URL.sourceURL, result.html);

        foreach (var URLToAdd in URLsToAdd)
        {
            URLSToCheck.Add(new LinkObject { sourceURL = URLToAdd.sourceURL, destinationURL = URLToAdd.destinationURL });
        }
    });

    if(URLSToCheck.Count == 0)break;
}

你可以看看BlockingCollection

BlockingCollection提供了生产者/消费者模式的实现:您的生产者将添加到阻塞集合,并且您的Parallel.ForEach将从集合中使用。

为此,您必须为BlockingCollection实现自定义分区程序(原因在此解释: https//blogs.msdn.microsoft.com/pfxteam/2010/04/06/parallelextensionsextras-tour-4-blockingcollectionextensions/

分区:

class BlockingCollectionPartitioner<T> : Partitioner<T>
{
    private BlockingCollection<T> _collection;

    internal BlockingCollectionPartitioner(BlockingCollection<T> collection)
    {
        if (collection == null)
            throw new ArgumentNullException("collection");
        _collection = collection;
    }

    public override bool SupportsDynamicPartitions 
    {
        get { return true; }
    }

    public override IList<IEnumerator<T>> GetPartitions(int partitionCount)
    {
        if (partitionCount < 1)
            throw new ArgumentOutOfRangeException("partitionCount");

        var dynamicPartitioner = GetDynamicPartitions();
        return Enumerable.Range(0, partitionCount).Select(_ => dynamicPartitioner.GetEnumerator()).ToArray();
    }

    public override IEnumerable<T> GetDynamicPartitions()
    {
        return _collection.GetConsumingEnumerable();
    }
}

然后你将使用它像:

BlockingCollection<LinkObject> URLSToCheck = new BlockingCollection<LinkObject>();

Parallel.ForEach(
    new BlockingCollectionPartitioner<LinkObject>(URLSToCheck), 
    new ParallelOptions { MaxDegreeOfParallelism = 5 }, URL =>
       {
            //....
       });

在另一个线程中,您将添加到URLSToCheck集合:

URLSToCheck.Add(...)

当你完成要处理的URL时,你调用URLSToCheck.CompleteAdding()并且Parallel.ForEach应该自动停止。

DataFlow在这里很方便。 使用ActionBlock可以很好地完成:

// Capture the variable, so it can be used in the next block
ActionBlock<LinkObject> = actionBlock = null;

actionBlock = new ActionBlock<LinkObject>(URL =>
{
    Checker Checker = new Checker();
    URLDownloadResult result = Checker.downloadFullURL(URL.destinationURL);
    List<LinkObject> URLsToAdd = Checker.findInternalUrls(URL.sourceURL, result.html);
    URLsToAdd.ForEach(actionBlock.Post)
},new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = 5});

然后添加到actionBlock你的初始网址:

actionBlock.Post(url1);
actionBlock.Post(url2);
...

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM