简体   繁体   English

使用LIFO逻辑运行的MailboxProcessor

[英]A MailboxProcessor that operates with a LIFO logic

I am learning about F# agents ( MailboxProcessor ). 我正在学习F#代理( MailboxProcessor )。

I am dealing with a rather unconventional problem. 我正在处理一个非常常规的问题。

  • I have one agent ( dataSource ) which is a source of streaming data. 我有一个代理( dataSource ),它是流数据的源。 The data has to be processed by an array of agents ( dataProcessor ). 数据必须由代理数组( dataProcessor )处理。 We can consider dataProcessor as some sort of tracking device. 我们可以将dataProcessor视为某种跟踪设备。
  • Data may flow in faster than the speed with which the dataProcessor may be able to process its input. 数据的流入速度可能比dataProcessor可能处理其输入的速度快。
  • It is OK to have some delay. 可以有一些延迟。 However, I have to ensure that the agent stays on top of its work and does not get piled under obsolete observations 但是,我必须确保代理始终处于工作状态,并且不会因过时的观察而堆积

I am exploring ways to deal with this problem. 我正在探索解决此问题的方法。

The first idea is to implement a stack (LIFO) in dataSource . 一个想法是在dataSource实现堆栈 (LIFO)。 dataSource would send over the latest observation available when dataProcessor becomes available to receive and process the data. dataProcessor可用于接收和处理数据时, dataSource将发送可用的最新观察值。 This solution may work but it may get complicated as dataProcessor may need to be blocked and re-activated; 该解决方案可能有效,但由于可能需要阻止并重新激活dataProcessor因此可能会变dataProcessor复杂。 and communicate its status to dataSource , leading to a two way communication problem. 并将其状态传达给dataSource ,从而导致双向通讯问题。 This problem may boil down to a blocking queue in the consumer-producer problem but I am not sure.. 这个问题可能归结为消费者-生产者问题中blocking queue ,但是我不确定。

The second idea is to have dataProcessor taking care of message sorting. 第二个想法是让dataProcessor负责消息排序。 In this architecture, dataSource will simply post updates in dataProcessor 's queue. 在这种体系结构中, dataSource只会将更新发布到dataProcessor的队列中。 dataProcessor will use Scan to fetch the latest data available in his queue. dataProcessor将使用“ Scan来获取队列中可用的最新数据。 This may be the way to go. 这可能是要走的路。 However, I am not sure if in the current design of MailboxProcessor it is possible to clear a queue of messages, deleting the older obsolete ones. 但是,我不确定在当前的MailboxProcessor设计中是否可以清除消息队列,删除较旧的过时消息。 Furthermore, here , it is written that: 此外, 在这里 ,写道:

Unfortunately, the TryScan function in the current version of F# is broken in two ways. 不幸的是,当前版本的F#中的TryScan函数以两种方式被破坏。 Firstly, the whole point is to specify a timeout but the implementation does not actually honor it. 首先,重点是指定一个超时,但是实现实际上并没有兑现它。 Specifically, irrelevant messages reset the timer. 具体来说,无关的消息会重置计时器。 Secondly, as with the other Scan function, the message queue is examined under a lock that prevents any other threads from posting for the duration of the scan, which can be an arbitrarily long time. 其次,与其他“扫描”功能一样,将在锁定下检查消息队列,该锁定可防止在扫描期间(可能是任意长时间)发布任何其他线程。 Consequently, the TryScan function itself tends to lock-up concurrent systems and can even introduce deadlocks because the caller's code is evaluated inside the lock (eg posting from the function argument to Scan or TryScan can deadlock the agent when the code under the lock blocks waiting to acquire the lock it is already under). 因此,TryScan函数本身倾向于锁定并发系统,甚至可能引入死锁,因为调用者的代码是在锁内求值的(例如,当锁下的代码阻塞等待时,从函数参数到Scan或TryScan的发布会死锁代理。获取它已经在下面的锁)。

Having the latest observation bounced back may be a problem. 使最新的观测结果反弹可能是一个问题。 The author of this post, @Jon Harrop, suggests that 这篇文章的作者@Jon Harrop建议

I managed to architect around it and the resulting architecture was actually better. 我设法围绕它进行了架构,并且最终的架构实际上更好。 In essence, I eagerly Receive all messages and filter using my own local queue. 本质上,我渴望使用我自己的本地队列Receive所有消息并进行过滤。

This idea is surely worth exploring but, before starting to play around with code, I would welcome some inputs on how I could structure my solution. 这个想法肯定值得探索,但是在开始使用代码之前,我将欢迎一些有关如何构建解决方案的意见。

Thank you. 谢谢。

Sounds like you might need a destructive scan version of the mailbox processor, I implemented this with TPL Dataflow in a blog series that you might be interested in. 听起来您可能需要使用破坏性的邮箱处理器扫描版本,我在您可能感兴趣的博客系列中使用TPL Dataflow实现了此功能。

My blog is currently down for maintenance but I can point you to the posts in markdown format. 我的博客目前正在维护中,但我可以将您指向Markdown格式的帖子。

Part1 第1部分
Part2 第2部分
Part3 第三部分

You can also check out the code on github 您也可以在github上查看代码

I also wrote about the issues with scan in my lurking horror post 我还在潜伏的恐怖帖子中写了关于扫描的问题

Hope that helps... 希望有帮助...

tl;dr I would try this: take Mailbox implementation from FSharp.Actor or Zach Bray's blog post, replace ConcurrentQueue by ConcurrentStack (plus add some bounded capacity logic) and use this changed agent as a dispatcher to pass messages from dataSource to an army of dataProcessors implemented as ordinary MBPs or Actors. tl; dr我会尝试:从FSharp.Actor或Zach Bray的博客文章中获取邮箱实现,将ConcurrentQueue替换为ConcurrentStack(并添加一些有限容量逻辑),然后使用此已更改的代理作为调度程序,将消息从数据源传递到实现为普通MBP或Actor的dataProcessor。

tl;dr2 If workers are a scarce and slow resource and we need to process a message that is the latest at the moment when a worker is ready, then it all boils down to an agent with a stack instead of a queue (with some bounded capacity logic) plus a BlockingQueue of workers. tl; dr2如果worker是一种稀缺且缓慢的资源,并且我们需要处理一条消息,该消息是worker准备就绪时的最新消息,那么所有这些消息都归结为具有堆栈而不是队列(具有一定界限)的代理容量逻辑)以及工作人员的BlockingQueue。 Dispatcher dequeues a ready worker, then pops a message from the stack and sends this message to the worker. 分派器使准备就绪的工作人员出队,然后从堆栈中弹出一条消息,然后将此消息发送给该工作人员。 After the job is done the worker enqueues itself to the queue when becomes ready (eg before let! msg = inbox.Receive() ). 作业完成后,工作者准备就绪时let! msg = inbox.Receive()自己排入队列(例如,在let! msg = inbox.Receive() )。 Dispatcher consumer thread then blocks until any worker is ready, while producer thread keeps the bounded stack updated. 然后,分派器使用者线程将阻塞,直到任何工作程序就绪为止,而生产者线程将使有界堆栈保持更新。 (bounded stack could be done with an array + offset + size inside a lock, below is too complex one) (有界堆栈可以用数组+偏移+锁中的大小来完成,下面太复杂了)

Details 细节

MailBoxProcessor is designed to have only one consumer. MailBoxProcessor设计为只有一个使用者。 This is even commented in the source code of MBP here (search for the word 'DRAGONS' :) ) 这甚至在MBP的源代码注释这里 (搜索单词“龙吟” :))

If you post your data to MBP then only one thread could take it from internal queue or stack. 如果将数据发布到MBP,则只有一个线程可以从内部队列或堆栈中获取数据。 In you particular use case I would use ConcurrentStack directly or better wrapped into BlockingCollection : 在您的特定用例中,我将直接使用ConcurrentStack或更好地包装到BlockingCollection中

  • It will allow many concurrent consumers 这将允许许多并发消费者
  • It is very fast and thread safe 它非常快速且线程安全
  • BlockingCollection has BoundedCapacity property that allows you to limit the size of a collection. BlockingCollection具有BoundedCapacity属性,该属性使您可以限制集合的大小。 It throws on Add , but you could catch it or use TryAdd . 它会引发Add ,但是您可以捕获它或使用TryAdd If A is a main stack and B is a standby, then TryAdd to A, on false Add to B and swap the two with Interlocked.Exchange , then process needed messages in A, clear it, make a new standby - or use three stacks if processing A could be longer than B could become full again; 如果A是主堆栈,B是备用堆栈,则将TryAdd Add到A,如果是false则Add到B,然后将它们与Interlocked.Exchange交换,然后在A中处理所需的消息,将其清除,制作一个新的备用-或使用三个堆栈如果处理A的时间可能长于B的时间,则处理B可能再次变满; in this way you do not block and do not lose any messages, but could discard unneeded ones is a controlled way. 这样,您不会阻塞也不会丢失任何消息,但是可以丢弃不需要的消息是一种受控方式。

BlockingCollection has methods like AddToAny/TakeFromAny, which work on an arrays of BlockingCollections. BlockingCollection具有AddToAny / TakeFromAny之类的方法,这些方法可用于BlockingCollections的数组。 This could help, eg: 这可能会有所帮助,例如:

  • dataSource produces messages to a BlockingCollection with ConcurrentStack implementation (BCCS) dataSource使用ConcurrentStack实现(BCCS)生成消息到BlockingCollection
  • another thread consumes messages from BCCS and sends them to an array of processing BCCSs. 另一个线程使用来自BCCS的消息,并将其发送到处理BCCS的数组。 You said that there is a lot of data. 您说有很多数据。 You may sacrifice one thread to be blocking and dispatching your messages indefinitely 您可能会牺牲一个线程来无限期地阻止和分发消息
  • each processing agent has its own BCCS or implemented as an Agent/Actor/MBP to which the dispatcher posts messages. 每个处理代理都有自己的BCCS或实现为调度程序向其发布消息的代理/演员/ MBP。 In your case you need to send a message to only one processorAgent, so you may store processing agents in a circular buffer to always dispatch a message to least recently used processor. 在您的情况下,您只需要发送一条消息给一个processorAgent,因此您可以将处理代理存储在循环缓冲区中,以便始终将消息发送给最近最少使用的处理器。

Something like this: 像这样:

            (data stream produces 'T)
                |
            [dispatcher's BCSC]
                |
            (a dispatcher thread consumes 'T  and pushes to processors, manages capacity of BCCS and LRU queue)
                 |                               |
            [processor1's BCCS/Actor/MBP] ... [processorN's BCCS/Actor/MBP]
                 |                               |
               (process)                         (process)

Instead of ConcurrentStack, you may want to read about heap data structure . 代替ConcurrentStack,您可能想了解堆数据结构 If you need your latest messages by some property of messages, eg timestamp, rather than by the order in which they arrive to the stack (eg if there could be delays in transit and arrival order <> creation order), you can get the latest message by using heap. 如果您需要消息的某些属性(例如时间戳)而不是消息到达堆栈的顺序(例如,如果传输和到达顺序<>创建顺序可能存在延迟),则需要获取最新消息,则可以获取最新消息通过使用堆消息。

If you still need Agents semantics/API, you could read several sources in addition to Dave's links, and somehow adopt implementation to multiple concurrent consumers: 如果您仍然需要Agents语义/ API,则除了阅读Dave的链接之外,还可以阅读其他资源,并以某种方式对多个并发使用者采用实现:

  • An interesting article by Zach Bray on efficient Actors implementation. Zach Bray的一篇有趣的文章 ,介绍了Actor的高效实现。 There you do need to replace (under the comment // Might want to schedule this call on another thread. ) the line execute true by a line async { execute true } |> Async.Start or similar, because otherwise producing thread will be consuming thread - not good for a single fast producer. 在那里,您确实需要替换(在注释中// Might want to schedule this call on another thread. )该行由行async { execute true } |> Async.Start或类似的行execute true ,因为否则会产生线程线程-不利于单个快速生产者。 However, for a dispatcher like described above this is exactly what needed. 但是,对于如上所述的调度员,这正是需要的。

  • FSharp.Actor (aka Fakka ) development branch and FSharp MPB source code (first link above) here could be very useful for implementation details. FSharp.Actor (又名Fakka开发分支和FSharp MPB源代码(上面的第一个链接)对于实现细节可能非常有用。 FSharp.Actors library has been in a freeze for several months but there is some activity in dev branch. FSharp.Actors库已经冻结了几个月,但在dev分支中有一些活动。

  • Should not miss discussion about Fakka in Google Groups in this context. 在这种情况下,不应错过有关 Google网上论坛中Fakka的讨论

I have a somewhat similar use case and for the last two days I have researched everything I could find on the F# Agents/Actors. 我有一个类似的用例,在过去的两天中,我研究了可以在F#Agents / Actor上找到的所有内容。 This answer is a kind of TODO for myself to try these ideas, of which half were born during writing it. 这个答案对我来说是一种尝试尝试这些想法的TODO,其中一半是在撰写过程中诞生的。

The simplest solution is to greedily eat all messages in the inbox when one arrives and discard all but the most recent. 最简单的解决方案是在收件箱到达时贪婪地吃掉收件箱中的所有邮件,并丢弃除最新邮件以外的所有邮件。 Easily done using TryReceive : 使用TryReceive轻松完成:

let rec readLatestLoop oldMsg =
  async { let! newMsg = inbox.TryReceive 0
          match newMsg with
          | None -> oldMsg
          | Some newMsg -> return! readLatestLoop newMsg }
let readLatest() =
  async { let! msg = inbox.Receive()
          return! readLatestLoop msg }

When faced with the same problem I architected a more sophisticated and efficient solution I called cancellable streaming and described in in an F# Journal article here . 遇到相同的问题时,我设计了一个更复杂,更有效的解决方案,称为可取消流,并在此处的F#Journal文章中进行了介绍 The idea is to start processing messages and then cancel that processing if they are superceded. 想法是开始处理消息,如果消息被取代,则取消该处理。 This significantly improves concurrency if significant processing is being done. 如果正在进行大量处理,这将显着提高并发性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM