简体   繁体   English

嵌套的Iteratees

[英]Nested Iteratees

I am working with a particular database where, upon a successful query, you are able to access a group of chunks of the resulting data using a specific command: 我正在使用一个特定的数据库,在成功查询后,您可以使用特定命令访问结果数据的一组块:

getResultData :: IO (ResponseCode, ByteString)

Now getResultData will return a response code and some data where the response codes look like this: 现在getResultData将返回响应代码和一些响应代码如下所示的数据:

response = GET_DATA_FAILED | OPERATION_SUCCEEDED | NO_MORE_DATA

The ByteString is one, some, or all of the chunks: ByteString是一个,部分或全部块:

Data http://desmond.imageshack.us/Himg189/scaled.php?server=189&filename=chunksjpeg.png&res=medium 数据http://desmond.imageshack.us/Himg189/scaled.php?server=189&filename=chunksjpeg.png&res=medium

The story does not end here. 故事并没有在这里结束。 There exists a stream of groups: 存在一组群体:

Stream http://desmond.imageshack.us/Himg695/scaled.php?server=695&filename=chunkgroupsjpeg.png&res=medium 流http://desmond.imageshack.us/Himg695/scaled.php?server=695&filename=chunkgroupsjpeg.png&res=medium

Once receiving a NO_MORE_DATA response from getResultData, a call to getNextItem will iterate the stream allowing me to start calls to getResultData again. 一旦从getResultData接收到NO_MORE_DATA响应,对getNextItem的调用将迭代流,允许我再次启动对getResultData的调用。 Once getNextItem returns STREAM_FINISHED, that's all she wrote; 一旦getNextItem返回STREAM_FINISHED,那就是她所写的全部内容; I have my data. 我有我的数据。

Now, I wish to remodel this phenomenon with either Date.Iteratee or Data.Enumerator. 现在,我希望使用Date.Iteratee或Data.Enumerator重新构建此现象。 Inasmuch as my existing Data.Iteratee solution works, it yet seems very naive and I feel as if I should be modeling this with nested iteratees as opposed to one big iteratee blob which is how my solution is currently implemented. 因为我现有的Data.Iteratee解决方案有效,但它似乎非常幼稚,我觉得好像我应该使用嵌套的iteratees来建模,而不是一个大的iteratee blob,这就是我的解决方案当前的实现方式。

I have been looking at the code of Data.Iteratee 0.8.6.2 and I am a bit confused when it comes to the nested stuff. 我一直在看Data.Iteratee 0.8.6.2的代码,当涉及到嵌套的东西时我有点困惑。

Are nested iteratees the proper course of action? 嵌套迭代是正确的行动方案吗? If so, how would one model this with nested iteratees? 如果是这样,那么如何使用嵌套迭代对此进行建模?

Regards 问候

I think nested iteratees are the correct approach, but this case has some unique problems which make it slightly different from most common examples. 我认为嵌套迭代是正确的方法,但是这种情况有一些独特的问题,这使得它与大多数常见的例子略有不同。

Chunks and groups 大块和团体

The first problem is to get the data source right. 第一个问题是使数据源正确。 Basically the logical divisions you've described would give you a stream equivalent to [[ByteString]] . 基本上你所描述的逻辑分区会给你一个等于[[ByteString]]的流。 If you create an enumerator to produce this directly, each element within the stream would be a full group of chunks, which presumably you wish to avoid (for memory reasons). 如果您创建一个枚举器来直接生成它,则流中的每个元素都将是一组完整的块,这可能是您希望避免的(出于内存原因)。 You could flatten everything into a single [ByteString] , but then you'd need to re-introduce boundaries, which would be pretty wasteful since the db is doing it for you. 您可以将所有内容[ByteString]为单个[ByteString] ,但是之后您需要重新引入边界,因为db正在为您执行此操作,这将非常浪费。

Ignoring the stream of groups for now, it appears that you need to divide the data into chunks yourself. 暂时忽略组流,您需要自己将数据分成块。 I would model this as: 我将其建模为:

enumGroup :: Enumerator ByteString IO a
enumGroup = enumFromCallback cb ()
 where
  cb () = do
    (code, data) <- getResultData
    case code of
        OPERATION_SUCCEEDED -> return $ Right ((True, ()), data)
        NO_MORE_DATA        -> return $ Right ((False, ()), data)
        GET_DATA_FAILED     -> return $ Left MyException

Since chunks are of a fixed size, you can easily chunk this with Data.Iteratee.group . 由于块是固定大小的,因此您可以使用Data.Iteratee.group轻松地将其分块。

enumGroupChunked :: Iteratee [ByteString] IO a -> IO (Iteratee ByteString IO a)
enumGroupChunked = enumGroup . joinI . group groupSize

Compare the type of this to Enumerator 将其类型与Enumerator进行比较

type Enumerator s m a = Iteratee s m a -> m (Iteratee s m a)

So enumGroupChunked is basically a fancy enumerator which changes the stream type. 所以enumGroupChunked基本上是一个改变流类型的花哨的枚举器。 This means that it takes a [ByteString] iteratee consumer, and returns an iteratee which consumes plain bytestrings. 这意味着它需要一个[ByteString] iteratee使用者,并返回一个消耗普通字节串的iteratee。 Often the return type of an enumerator doesn't matter; 通常,调查员的返回类型无关紧要; it's simply an iteratee which you evaluate with run (or tryRun ) to get at the output, so you could do the same here: 它只是一个用run (或tryRun )来评估输出的tryRun ,所以你可以在这里做同样的事情:

evalGroupChunked :: Iteratee [ByteString] IO a -> IO a
evalGroupChunked i = enumGroupChunked i >>= run

If you have more complicated processing to do on each group, the easiest place to do so would be in the enumGroupChunked function. 如果您对每个组执行更复杂的处理,最简单的方法是在enumGroupChunked函数中。

Stream of groups 团体流

Now this is out of the way, what to do about the stream of groups? 现在这已经不在了,如何处理群组流? The answer depends on how you want to consume them. 答案取决于您想要如何消费它们。 If you want to essentially treat each group in the stream independently, I would do something similar to this: 如果你想基本上独立地处理流中的每个组,我会做类似的事情:

foldStream :: Iteratee [ByteString] IO a -> (b -> a -> b) -> b -> IO b
foldStream iter f acc0 = do
  val <- evalGroupChunked iter
  res <- getNextItem
  case res of 
        OPERATION_SUCCEEDED -> foldStream iter f $! f acc0 val
        NO_MORE_DATA        -> return $ f acc0 val
        GET_DATA_FAILED     -> error "had a problem"

However, let's say you want to do some sort of stream processing of the entire dataset, not just individual groups. 但是,假设您想要对整个数据集进行某种流处理,而不仅仅是单个组。 That is, you have a 也就是说,你有一个

bigProc :: Iteratee [ByteString] IO a

that you want to run over the entire dataset. 您想要在整个数据集上运行。 This is where the return iteratee of an enumerator is useful. 这是枚举器的返回迭代有用的地方。 Some earlier code will be slightly different now: 一些早期的代码现在会略有不同:

enumGroupChunked' :: Iteratee [ByteString] IO a
  -> IO (Iteratee ByteString IO (Iteratee [ByteString] IO a))
enumGroupChunked' = enumGroup . group groupSize

procStream :: Iteratee [ByteString] IO a -> a
procStream iter = do
  i' <- enumGroupChunked' iter >>= run
  res <- getNextItem
  case res of 
        OPERATION_SUCCEEDED -> procStream i'
        NO_MORE_DATA        -> run i'
        GET_DATA_FAILED     -> error "had a problem"

This usage of nested iteratees (ie Iteratee s1 m (Iteratee s2 ma) ) is slightly uncommon, but it's particularly helpful when you want to sequentially process data from multiple Enumerators. 嵌套迭代器(即Iteratee s1 m (Iteratee s2 ma) )的这种用法略显不常见,但是当您想要顺序处理来自多个枚举器的数据时,它尤其有用。 The key is to recognize that run ing the outer iteratee will give you an iteratee which is ready to receive more data. 关键是要认识到run外部迭代将为您提供一个可以接收更多数据的迭代。 It's a model that works well in this case, because you can enumerate each group independently but process them as a single stream. 这是一个在这种情况下运行良好的模型,因为您可以独立枚举每个组,但将它们作为单个流处理。

One caution: the inner iteratee will be in whatever state it was left in. Suppose that the last chunk of a group may be smaller than a full chunk, eg 一个警告:内部迭代将处于它所处的任何状态。假设一个组的最后一个块可能小于一个完整的块,例如

   Group A               Group B               Group C
   1024, 1024, 512       1024, 1024, 1024      1024, 1024, 1024

What will happen in this case is that, because group is combining data into chunks of size 1024, it will combine the last chunk of Group A with the first 512 bytes of Group B. This isn't a problem with the foldStream example because that code terminates the inner iteratee (with joinI ). 在这种情况下会发生的是,因为group将数据组合成大小为1024的块,它将组A的最后一个块与组B的前512个字节组合。这不是foldStream示例的问题,因为代码终止内部iteratee(使用joinI )。 That means the groups are truly independent, so you have to treat them as such. 这意味着这些团体是真正独立的,所以你必须这样对待它们。 If you want to combine the groups as in procStream , you have to think of the entire stream. 如果要在procStream组合组,则必须考虑整个流。 If this is your case, then you'll need to use something more sophisticated than just group . 如果这是你的情况,那么你需要使用比group更复杂的东西。

Data.Iteratee vs Data.Enumerator Data.Iteratee与Data.Enumerator

Without getting into a debate of the merits of either package, not to mention IterIO (I'm admittedly biased), I would like to point out what I consider the most significant difference between the two: the abstraction of the stream. 没有讨论任何一个软件包的优点,更不用说IterIO (我承认有偏见),我想指出我认为两者之间最重要的区别:流的抽象。

In Data.Iteratee, a consumer Iteratee ByteString ma operates on a notional ByteString of some length, with access to a single chunk of ByteString at one time. 在Data.Iteratee中,消费者Iteratee ByteString ma在一定长度的名义ByteString上操作,同时可以访问单个ByteString块。

In Data.Enumerator, a consumer Iteratee ByteString ma operates on a notional [ByteString], with access to one or more elements (bytestrings) at one time. 在Data.Enumerator中,消费者Iteratee ByteString ma在名义上的[ByteString]上操作,同时可以访问一个或多个元素(字节串)。

This means that most Data.Iteratee operations are element-focused, that is with an Iteratee ByteString they'll operate on a single Word8 , whereas Data.Enumerator operations are chunk-focused, operating on a ByteString . 这意味着大多数Data.Iteratee操作都是以元素为中心的,即使用Iteratee ByteString它们将在单个Word8 ,而Data.Enumerator操作是以块为中心的,在ByteString

You can think of Data.Iteratee.Iteratee [s] ma === Data.Enumerator.Iteratee sma . 你可以想到Data.Iteratee.Iteratee [s] ma === Data.Enumerator.Iteratee sma

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM