简体   繁体   English

在 TPL 数据流中动态订阅/取消订阅

[英]Dynamically subscribe/unsubscribe in TPL Dataflow

I have a stream of messages and based on some criteria I want each consumer to be able to process some of them in parallel.我有一个 stream 消息,并且基于一些标准,我希望每个消费者能够并行处理其中的一些。 Each consumer should be able to subscribe and unsubscribe dynamically.每个消费者都应该能够动态订阅和取消订阅。在此处输入图像描述

I have the following input data constraints:我有以下输入数据约束:

  • Around 500 messages per seconds每秒大约 500 条消息
  • Around 15000 consumers约15000名消费者
  • Around 500 categories约 500 个类别
  • In most cases, each consumer is subscribed for 1-3 categories.大多数情况下,每个消费者订阅 1-3 个类别。

So far this is what I have:到目前为止,这就是我所拥有的:

public class Test
{
    static void Main()
    {
        var consumer1 = new Consumer("Consumer1");
        consumer1.SubscribeForCategory(1);
        consumer1.SubscribeForCategory(2);

        var consumer2 = new Consumer("Consumer2");
        consumer2.SubscribeForCategory(2);
        consumer2.SubscribeForCategory(3);
        consumer2.SubscribeForCategory(4);

        var consumer3 = new Consumer("Consumer3");
        consumer3.SubscribeForCategory(3);
        consumer3.SubscribeForCategory(4);

        var consumers = new[] {consumer1, consumer2, consumer3};
        var publisher = new Publisher(consumers);

        var message1 = new Message(1, "message1 test");
        var message2 = new Message(2, "message2");
        var message3 = new Message(1, "message3");
        var message4 = new Message(3, "message4 test");
        var message5 = new Message(4, "message5");
        var message6 = new Message(3, "message6");

        var messages = new[] {message1, message2, message3, message4, message5, message6};

        foreach (var message in messages)
        {
            publisher.Publish(message);
        }

        Console.ReadLine();
    }
}

public class Message
{
    public Message(int categoryId, string data)
    {
        CategoryId = categoryId;
        Data = data;
    }

    public int CategoryId { get; }

    public string Data { get; }
}

public class Publisher
{
    private readonly IEnumerable<Consumer> _consumers;

    public Publisher(IEnumerable<Consumer> consumers)
    {
        _consumers = consumers;
    }

    public void Publish(Message message)
    {
        IEnumerable<Consumer> consumers = _consumers.Where(c => c.CategoryIds.Contains(message.CategoryId));
        foreach (Consumer consumer in consumers)
        {
            consumer.AddMessage(message);
        }
    }
}

public class Consumer
{
    private readonly HashSet<int> _categoryIds;
    private readonly ActionBlock<Message> _queue;

    public Consumer(string name)
    {
        Name = name;
        _categoryIds = new HashSet<int>();

        _queue = new ActionBlock<Message>(async m => { await Foo(m); }, 
                                          new ExecutionDataflowBlockOptions 
                                          {
                                              MaxDegreeOfParallelism = 1, 
                                              SingleProducerConstrained = true
                                          });
    }

    public string Name { get; }

    public IReadOnlyCollection<int> CategoryIds => _categoryIds;

    public void AddMessage(Message message)
    {
        bool accepted = _queue.Post(message);
        if (!accepted)
        {
            Console.WriteLine("Message has been rejected!");
        }
    }

    public void SubscribeForCategory(int categoryId)
    {
        _categoryIds.Add(categoryId);
    }

    private async Task Foo(Message message)
    {
        // process message
        await Task.Delay(10);

        if (message.Data.Contains("test"))
        {
            _categoryIds.Remove(message.CategoryId);
        }

        Console.WriteLine($"[{Name}] - category id: [{message.CategoryId}] data: [{message.Data}]");
    }
}

Unfortunately, there are several issues with that solution:不幸的是,该解决方案存在几个问题:

  1. While consumer process each message there is the possibility to unsubscribe for some message which has been added to the ActionBlock input queue.在消费者处理每条消息时,有可能取消订阅已添加到ActionBlock输入队列的某些消息。
  2. In Publisher.cs I am iterating over each account category collection and later in Account Foo method, there is a chance to remove some category which will lead to the following exception: System.InvalidOperationException: Collection was modified;在 Publisher.cs 中,我正在迭代每个帐户类别集合,稍后在 Account Foo 方法中,有机会删除某些类别,这将导致以下异常: System.InvalidOperationException: Collection was modified; enumeration operation may not execute.枚举操作可能无法执行。
  3. Also I am not pretty sure whether is a good idea to have "dispatching logic" into publisher.Publish()此外,我不太确定将“调度逻辑”放入 publisher.Publish() 是否是个好主意

One possible solution is to forward all messages to each consumer (and each consumer should take a decision whether should or shouldn't process it) but I am afraid that this is going to be much slower.一种可能的解决方案是将所有消息转发给每个消费者(每个消费者都应该决定是否应该处理它)但我担心这会慢得多。

I am aware of actor model-based frameworks like Akka.Net and Microsoft Orleans, but I want all of this to happen in-process (if it's achievable of course).我知道基于 Actor 模型的框架,例如 Akka.Net 和 Microsoft Orleans,但我希望所有这些都在进程中发生(当然,如果可以实现的话)。

Does anyone have a more elegant solution?有没有人有更优雅的解决方案? Do you have any suggestions on how can I improve the current approach?您对如何改进当前方法有什么建议吗?

I think that the entity Category is missing from your model, and adding it will improve your model not only conceptually but also performance-wise.我认为您的 model 中缺少实体Category ,添加它不仅在概念上而且在性能方面都会改善您的 model。 Each category can hold a list of the consumers that are subscribed for this category, making it trivial to send a message only to the subscribed consumers.每个类别都可以保存订阅该类别的消费者列表,这使得仅向订阅的消费者发送消息变得微不足道。

For solving the issue of the thread-safety my suggestion is to use immutable collections instead of mutable HashSet<T> s or List<T> s.为了解决线程安全问题,我的建议是使用不可变的 collections而不是可变的HashSet<T> s 或List<T> s。 The immutable collections offer the advantage that they can be updated safely and atomically with low-lock techniques ( ImmutableInterlocked.Update method), and can provide at any time a snapshot of their contents that is unaffected by future modifications.不可变 collections 的优势在于可以使用低锁定技术( ImmutableInterlocked.Update方法)安全地自动更新它们,并且可以随时提供不受未来修改影响的内容快照。 If you are asking how it is possible to mutate an immutable collection, the answer is that you are not mutating it, instead you are replacing the reference with a different immutable collection.如果你问如何改变不可变集合,答案是你没有改变它,而是用不同的不可变集合替换引用。 These structures are implemented in a way that allows high reusability of their internal bits and pieces.这些结构的实现方式允许其内部零碎的高度可重用性。 For example adding an item in a ImmutableHashSet<T> that already holds 1,000,000 items, does not require the allocation of a new memory block that contains all the old items plus the new one.例如,在已经包含 1,000,000 个项目的ImmutableHashSet<T>中添加一个项目,不需要分配包含所有旧项目和新项目的新 memory 块。 Only a handful of tiny objects (nodes in the internal binary tree) will be allocated.只有少数微小的对象(内部二叉树中的节点)将被分配。

This convenience comes at a price: most operations on immutable collections are at least 10 times slower than the same operations on their mutable counterparts.这种便利是有代价的:不可变 collections 上的大多数操作比可变对应物上的相同操作至少慢 10 倍。 Most probably this overhead will be negligible in the grand scheme of things, but you may want to profile and measure it yourself, and judge whether it is impactful or not.这种开销很可能在宏伟的计划中可以忽略不计,但您可能想自己分析和衡量它,并判断它是否有影响。

The Category class: class Category

public class Category
{
    private ImmutableHashSet<Consumer> _consumers;

    public int Id { get; }
    public ImmutableHashSet<Consumer> Consumers => Volatile.Read(ref _consumers);

    public Category(int id)
    {
        this.Id = id;
        _consumers = ImmutableHashSet.Create<Consumer>();
    }

    public void SubscribeConsumer(Consumer consumer) =>
        ImmutableInterlocked.Update(ref _consumers, col => col.Add(consumer));

    public void UnsubscribeConsumer(Consumer consumer) =>
        ImmutableInterlocked.Update(ref _consumers, col => col.Remove(consumer));
}

Notice the Volatile.Read , that ensures that the most recent reference stored in the _consumers field will be immediately visible to all threads accessing the Consumers property.请注意Volatile.Read ,它确保存储在_consumers字段中的最新引用将立即对所有访问Consumers属性的线程可见。

The Consumer class: Consumer class:

public class Consumer
{
    private readonly ActionBlock<Message> _block;
    private IImmutableList<Category> _categories;

    public string Name { get; }
    public IImmutableList<Category> Categories => Volatile.Read(ref _categories);

    public Consumer(string name)
    {
        this.Name = name;
        _categories = ImmutableArray.Create<Category>();
        _block = new ActionBlock<Message>(async message =>
        {
            if (!Categories.Any(cat => cat.Id == message.CategoryId)) return;
            // Process message...
        });
    }

    public void SendMessage(Message message)
    {
        bool accepted = _block.Post(message);
        Debug.Assert(accepted);
    }

    public void SubscribeForCategory(Category category)
    {
        ImmutableInterlocked.Update(ref _categories, col => col.Add(category));
        category.SubscribeConsumer(this);
    }

    public void UnsubscribeForCategory(Category category)
    {
        ImmutableInterlocked.Update(ref _categories, col => col.Remove(category));
        category.UnsubscribeConsumer(this);
    }
}

Notice that the SubscribeForCategory method has also the responsibility of adding the reverse relation (category -> consumer).请注意, SubscribeForCategory方法还负责添加反向关系(类别->消费者)。 In the above implementation these two relations are not added atomically in regard with each other, meaning that an observer could see a consumer subscribed to a category, and the category not subscribed to the consumer.在上面的实现中,这两个关系不是原子添加的,这意味着观察者可以看到消费者订阅了一个类别,而该类别没有订阅消费者。 From your description is seems that no such observer exists in your app, so this inconsistency probably doesn't matter too much.根据您的描述,您的应用程序中似乎不存在这样的观察者,因此这种不一致可能并不重要。

The Publisher class needs to hold a list of categories, instead of consumers: Publisher class 需要保存类别列表,而不是消费者:

public class Publisher
{
    private readonly Dictionary<int, Category> _categories;

    public Publisher(IEnumerable<Category> categories)
    {
        _categories = categories.ToDictionary(cat => cat.Id);
    }

    public void Publish(Message message)
    {
        var category = _categories[message.CategoryId];
        foreach (Consumer consumer in category.Consumers)
            consumer.SendMessage(message);
    }
}

Notice how simpler the Publish method is.请注意Publish方法是多么简单。

The TPL DataFlow library already provides what you want. TPL DataFlow 库已经提供了您想要的。 Its blocks aren't queues, they're the actual producers and consumers.它的块不是队列,它们是实际的生产者和消费者。 You could remove almost all of the code you added.您可以删除几乎所有添加的代码。 You could even use a LINQ query to create and link the "publisher" and "consumers":您甚至可以使用 LINQ 查询来创建和链接“发布者”和“消费者”:

var n=10;
var consumers=( from i in Enumerable.Range(0,n)
                let categories=new ConcurrentDictinoary<int,int>()
                select new { 
                             Block=new ActionBlock(msg=>Consume(msg,categories)
                                                        ,blockOptions),
                             Categories=categories
                }).ToArray();

foreach(var pair in consumers)
{
    publisher.LinkTo(pair.Block,linkOption,msg=>IsAllowed(msg,pair.Category));
}

bool IsAllowed(Message msg,ConcurrentDictionary<int,int> categories)
{
    return categories.ContainsKey(msg.CategoryId);
}

async Task Consume(Message message,ConcurrentDictinary<int,int> categories)
{
    if (message.Data.Contains("test"))
    {
        categories.TryRemove(message.CategoryId);
    }
    ...
}

It's no accident that the blocks work with functions.块与功能一起使用并非偶然。 The Dataflow library and the CSP paradigm it's based on are very different from OOP, and much closer to functional programming. Dataflow 库和它所基于的 CSP 范式与 OOP 非常不同,更接近于函数式编程。

By the way, TPL Dataflow grew out of the Microsoft Robotics Frameworks and the Concurrency Runtime .顺便说一句,TPL 数据流是从 Microsoft 机器人框架和并发运行时发展而来的。 In robotics and automation there are a lot of microprocessors exchanging messages.在机器人和自动化领域,有很多微处理器在交换信息。 Dataflow It's specifically built to create complex processing meshes and handle lots of message.数据流它专门用于创建复杂的处理网格并处理大量消息。

Explanation解释

Dataflow isn't a set of queues, it contains active blocks that are meant to be linked in a pipeline.数据流不是一组队列,它包含旨在链接到管道中的活动块。 An ActionBlock isn't a queue, it has a queue. ActionBlock 不是队列,它一个队列。 In reality it's a Consumer, typically found at the tail of a pipeline.实际上,它是一个消费者,通常位于管道的尾部。 A TransformBlock receives incoming messages, processes them one by one then sends them to any linked blocks. TransformBlock 接收传入的消息,一一处理它们,然后将它们发送到任何链接的块。

Blocks are linked, so you don't need to manually take messages from one block and pass them to another.块是链接的,因此您无需手动从一个块中获取消息并将它们传递给另一个块。 The Link can contain a predicate, used to filter the messages accepted by target blocks. Link 可以包含一个谓词,用于过滤目标块接受的消息。 It's possible to cut a link by calling Dispose on it.可以通过调用Dispose来切断链接。

Assuming this is the "consumer" method:假设这是“消费者”方法:

async Task Consume(Message message)
{
    await Task.Delay(100);
    Console.WriteLine($"Category id: [{message.CategoryId}] data: [{message.Data}]");
}

You can create a few ActionBlocks, perhaps in an array:您可以创建一些 ActionBlock,也许在一个数组中:

var consumers=new[]{
     new ActionBlock(Consume),
     new ActionBlock(Consume),
     new ActionBlock(Consume)
};

Each action block could use a different delegate of course.当然,每个动作块都可以使用不同的委托。

The "head" of the pipeline should probably be a TransformBlock.管道的“头”应该是一个 TransformBlock。 In this case, the Publisher doesn't do anything except get linked to the target blocks.在这种情况下,发布者除了链接到目标块之外什么都不做。 At least we can print something:至少我们可以打印一些东西:

Message PassThrough(Message message)
{
    Console.WriteLine("Incoming");
    return Message;
}

var publisher=new TransformBlock(PassThrough);

You can link the "publisher" to the "consumers" with LinkTo :您可以使用LinkTo将“发布者”链接到“消费者”:

var options=new DataflowLinkOptions { PropagateCompletion=true};

var link1=publisher.LinkTo(consumers[0],options, msg=>msg.CategoryId % 3==0);
var link2=publisher.LinkTo(consumers[1],options, msg=>msg.CategoryId % 3==1);
var link3=publisher.LinkTo(consumers[2],options, msg=>msg.CategoryId % 3==2);

Messages produced by the "publisher" block will be sent to the first target whose link predicate accepts it. “发布者”块产生的消息将被发送到链接谓词接受它的第一个目标。 Messages are offered to links in the order they were created.消息按照创建的顺序提供给链接。 If no link accepts the message, it will stay in the output queue and block it.如果没有链接接受该消息,它将留在 output 队列中并阻止它。

In real scenarios one should always ensure that all messages are handle or that there is a block that can handle anything that doesn't match.在实际场景中,应该始终确保所有消息都得到处理,或者有一个块可以处理任何不匹配的内容。

public.LinkTo(theOtherBlock,options);

The link1 , link2 , link3 objects are just IDisposeable s. link1link2link3对象只是IDisposeable They can be used to break a link:它们可用于断开链接:

link2.Dispose();

Links can be created and broken at any time, changing the shape of the pipeline (or mesh in more complex designs) as needed.可以随时创建和断开链接,根据需要更改管道(或更复杂设计中的网格)的形状。 Any messages already posted to a target block's queue won't be discarded if a link is broken or modified though.如果链接被破坏或修改,任何已经发布到目标块队列的消息都不会被丢弃。

To reduce the number of unwanted messages we can add a bound to each block's input queue:为了减少不需要的消息的数量,我们可以为每个块的输入队列添加一个绑定:

var blockOptions=new DataflowBlockOptions { BoundedCapacity=1 };

var consumers=new[]{
     new ActionBlock(Consume,blockOptions),
     new ActionBlock(Consume,blockOptions),
     new ActionBlock(Consume,blockOptions)
};

To change the accepted messages dynamically, we can store the values in eg a ConcurrentDictionary .要动态更改接受的消息,我们可以将值存储在例如ConcurrentDictionary中。 A predicate may be trying to check a message at the same time a consumer modifies the permitted values:谓词可能会在消费者修改允许值的同时尝试检查消息:

ConcurrentDictionary[] _allowedCategories=new[] {
    new ConcurrentDictionary<int,int>(),
    new ConcurrentDictionary<int,int>(),
    new ConcurrentDictionary<int,int>(),
};

async Task Consume(Message message,ConcurrentDictinary<int,int> categories)
{
    if (message.Data.Contains("test"))
    {
        categories.TryRemove(message.CategoryId);
    }
    ...
}

And the "consumers" change to而“消费者”变为

var consumers=new[]{
     new ActionBlock(msg=>Consume(msg,categories[0])),
     new ActionBlock(msg=>Consume(msg,categories[1])),
     new ActionBlock(msg=>Consume(msg,categories[2]))
};

It's better to create a separate method for the link predicate:最好为链接谓词创建一个单独的方法:

bool IsAllowed(Message msg,ConcurrentDictionary<int,int> categories)
{
    return categories.ContainsKey(msg.CategoryId);
}

var link1=publisher.LinkTo(consumers[0],options, msg=>IsAllowed(msg,categories[0]));
var link2=publisher.LinkTo(consumers[1],options, msg=>IsAllowed(msg,categories[1]));

One could create all these with LINQ and `Enumerable.Range.可以使用 LINQ 和 Enumerable.Range 创建所有这些。 Whether that's a good idea is another matter:这是否是一个好主意是另一回事:

var n=10;
var consumers=( from i in Enumerable.Range(0,n)
                         let categories=new ConcurrentDictinoary<int,int>()
                         select new { 
                             Block=new ActionBlock(msg=>Consume(msg,categories)
                                                        ,blockOptions),
                             Categories=categories
                         }).ToArray();

foreach(var pair in consumers)
{
    publisher.LinkTo(pair.Block,linkOption,msg=>IsAllowed(msg,pair.Category));
}

No matter how the mesh is built, publishing to it is the same.无论网格如何构建,发布到它都是一样的。 Use SendAsync on the head block在 head 块上使用SendAsync

for(int i=0;i<1000;i++)
{
    var msg=new Message(...);
    await publisher.SendAsync(msg);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM