[英]Nested Iteratees
I am working with a particular database where, upon a successful query, you are able to access a group of chunks of the resulting data using a specific command: 我正在使用一个特定的数据库,在成功查询后,您可以使用特定命令访问结果数据的一组块:
getResultData :: IO (ResponseCode, ByteString)
Now getResultData will return a response code and some data where the response codes look like this: 现在getResultData将返回响应代码和一些响应代码如下所示的数据:
response = GET_DATA_FAILED | OPERATION_SUCCEEDED | NO_MORE_DATA
The ByteString is one, some, or all of the chunks: ByteString是一个,部分或全部块:
Data http://desmond.imageshack.us/Himg189/scaled.php?server=189&filename=chunksjpeg.png&res=medium 数据http://desmond.imageshack.us/Himg189/scaled.php?server=189&filename=chunksjpeg.png&res=medium
The story does not end here. 故事并没有在这里结束。 There exists a stream of groups:
存在一组群体:
Stream http://desmond.imageshack.us/Himg695/scaled.php?server=695&filename=chunkgroupsjpeg.png&res=medium 流http://desmond.imageshack.us/Himg695/scaled.php?server=695&filename=chunkgroupsjpeg.png&res=medium
Once receiving a NO_MORE_DATA response from getResultData, a call to getNextItem will iterate the stream allowing me to start calls to getResultData again. 一旦从getResultData接收到NO_MORE_DATA响应,对getNextItem的调用将迭代流,允许我再次启动对getResultData的调用。 Once getNextItem returns STREAM_FINISHED, that's all she wrote;
一旦getNextItem返回STREAM_FINISHED,那就是她所写的全部内容; I have my data.
我有我的数据。
Now, I wish to remodel this phenomenon with either Date.Iteratee or Data.Enumerator. 现在,我希望使用Date.Iteratee或Data.Enumerator重新构建此现象。 Inasmuch as my existing Data.Iteratee solution works, it yet seems very naive and I feel as if I should be modeling this with nested iteratees as opposed to one big iteratee blob which is how my solution is currently implemented.
因为我现有的Data.Iteratee解决方案有效,但它似乎非常幼稚,我觉得好像我应该使用嵌套的iteratees来建模,而不是一个大的iteratee blob,这就是我的解决方案当前的实现方式。
I have been looking at the code of Data.Iteratee 0.8.6.2 and I am a bit confused when it comes to the nested stuff. 我一直在看Data.Iteratee 0.8.6.2的代码,当涉及到嵌套的东西时我有点困惑。
Are nested iteratees the proper course of action? 嵌套迭代是正确的行动方案吗? If so, how would one model this with nested iteratees?
如果是这样,那么如何使用嵌套迭代对此进行建模?
Regards 问候
I think nested iteratees are the correct approach, but this case has some unique problems which make it slightly different from most common examples. 我认为嵌套迭代是正确的方法,但是这种情况有一些独特的问题,这使得它与大多数常见的例子略有不同。
Chunks and groups 大块和团体
The first problem is to get the data source right. 第一个问题是使数据源正确。 Basically the logical divisions you've described would give you a stream equivalent to
[[ByteString]]
. 基本上你所描述的逻辑分区会给你一个等于
[[ByteString]]
的流。 If you create an enumerator to produce this directly, each element within the stream would be a full group of chunks, which presumably you wish to avoid (for memory reasons). 如果您创建一个枚举器来直接生成它,则流中的每个元素都将是一组完整的块,这可能是您希望避免的(出于内存原因)。 You could flatten everything into a single
[ByteString]
, but then you'd need to re-introduce boundaries, which would be pretty wasteful since the db is doing it for you. 您可以将所有内容
[ByteString]
为单个[ByteString]
,但是之后您需要重新引入边界,因为db正在为您执行此操作,这将非常浪费。
Ignoring the stream of groups for now, it appears that you need to divide the data into chunks yourself. 暂时忽略组流,您需要自己将数据分成块。 I would model this as:
我将其建模为:
enumGroup :: Enumerator ByteString IO a
enumGroup = enumFromCallback cb ()
where
cb () = do
(code, data) <- getResultData
case code of
OPERATION_SUCCEEDED -> return $ Right ((True, ()), data)
NO_MORE_DATA -> return $ Right ((False, ()), data)
GET_DATA_FAILED -> return $ Left MyException
Since chunks are of a fixed size, you can easily chunk this with Data.Iteratee.group
. 由于块是固定大小的,因此您可以使用
Data.Iteratee.group
轻松地将其分块。
enumGroupChunked :: Iteratee [ByteString] IO a -> IO (Iteratee ByteString IO a)
enumGroupChunked = enumGroup . joinI . group groupSize
Compare the type of this to Enumerator
将其类型与
Enumerator
进行比较
type Enumerator s m a = Iteratee s m a -> m (Iteratee s m a)
So enumGroupChunked
is basically a fancy enumerator which changes the stream type. 所以
enumGroupChunked
基本上是一个改变流类型的花哨的枚举器。 This means that it takes a [ByteString] iteratee consumer, and returns an iteratee which consumes plain bytestrings. 这意味着它需要一个[ByteString] iteratee使用者,并返回一个消耗普通字节串的iteratee。 Often the return type of an enumerator doesn't matter;
通常,调查员的返回类型无关紧要; it's simply an iteratee which you evaluate with
run
(or tryRun
) to get at the output, so you could do the same here: 它只是一个用
run
(或tryRun
)来评估输出的tryRun
,所以你可以在这里做同样的事情:
evalGroupChunked :: Iteratee [ByteString] IO a -> IO a
evalGroupChunked i = enumGroupChunked i >>= run
If you have more complicated processing to do on each group, the easiest place to do so would be in the enumGroupChunked
function. 如果您对每个组执行更复杂的处理,最简单的方法是在
enumGroupChunked
函数中。
Stream of groups 团体流
Now this is out of the way, what to do about the stream of groups? 现在这已经不在了,如何处理群组流? The answer depends on how you want to consume them.
答案取决于您想要如何消费它们。 If you want to essentially treat each group in the stream independently, I would do something similar to this:
如果你想基本上独立地处理流中的每个组,我会做类似的事情:
foldStream :: Iteratee [ByteString] IO a -> (b -> a -> b) -> b -> IO b
foldStream iter f acc0 = do
val <- evalGroupChunked iter
res <- getNextItem
case res of
OPERATION_SUCCEEDED -> foldStream iter f $! f acc0 val
NO_MORE_DATA -> return $ f acc0 val
GET_DATA_FAILED -> error "had a problem"
However, let's say you want to do some sort of stream processing of the entire dataset, not just individual groups. 但是,假设您想要对整个数据集进行某种流处理,而不仅仅是单个组。 That is, you have a
也就是说,你有一个
bigProc :: Iteratee [ByteString] IO a
that you want to run over the entire dataset. 您想要在整个数据集上运行。 This is where the return iteratee of an enumerator is useful.
这是枚举器的返回迭代有用的地方。 Some earlier code will be slightly different now:
一些早期的代码现在会略有不同:
enumGroupChunked' :: Iteratee [ByteString] IO a
-> IO (Iteratee ByteString IO (Iteratee [ByteString] IO a))
enumGroupChunked' = enumGroup . group groupSize
procStream :: Iteratee [ByteString] IO a -> a
procStream iter = do
i' <- enumGroupChunked' iter >>= run
res <- getNextItem
case res of
OPERATION_SUCCEEDED -> procStream i'
NO_MORE_DATA -> run i'
GET_DATA_FAILED -> error "had a problem"
This usage of nested iteratees (ie Iteratee s1 m (Iteratee s2 ma)
) is slightly uncommon, but it's particularly helpful when you want to sequentially process data from multiple Enumerators. 嵌套迭代器(即
Iteratee s1 m (Iteratee s2 ma)
)的这种用法略显不常见,但是当您想要顺序处理来自多个枚举器的数据时,它尤其有用。 The key is to recognize that run
ing the outer iteratee will give you an iteratee which is ready to receive more data. 关键是要认识到
run
外部迭代将为您提供一个可以接收更多数据的迭代。 It's a model that works well in this case, because you can enumerate each group independently but process them as a single stream. 这是一个在这种情况下运行良好的模型,因为您可以独立枚举每个组,但将它们作为单个流处理。
One caution: the inner iteratee will be in whatever state it was left in. Suppose that the last chunk of a group may be smaller than a full chunk, eg 一个警告:内部迭代将处于它所处的任何状态。假设一个组的最后一个块可能小于一个完整的块,例如
Group A Group B Group C
1024, 1024, 512 1024, 1024, 1024 1024, 1024, 1024
What will happen in this case is that, because group
is combining data into chunks of size 1024, it will combine the last chunk of Group A with the first 512 bytes of Group B. This isn't a problem with the foldStream
example because that code terminates the inner iteratee (with joinI
). 在这种情况下会发生的是,因为
group
将数据组合成大小为1024的块,它将组A的最后一个块与组B的前512个字节组合。这不是foldStream
示例的问题,因为代码终止内部iteratee(使用joinI
)。 That means the groups are truly independent, so you have to treat them as such. 这意味着这些团体是真正独立的,所以你必须这样对待它们。 If you want to combine the groups as in
procStream
, you have to think of the entire stream. 如果要在
procStream
组合组,则必须考虑整个流。 If this is your case, then you'll need to use something more sophisticated than just group
. 如果这是你的情况,那么你需要使用比
group
更复杂的东西。
Data.Iteratee vs Data.Enumerator Data.Iteratee与Data.Enumerator
Without getting into a debate of the merits of either package, not to mention IterIO (I'm admittedly biased), I would like to point out what I consider the most significant difference between the two: the abstraction of the stream. 没有讨论任何一个软件包的优点,更不用说IterIO (我承认有偏见),我想指出我认为两者之间最重要的区别:流的抽象。
In Data.Iteratee, a consumer Iteratee ByteString ma
operates on a notional ByteString of some length, with access to a single chunk of ByteString
at one time. 在Data.Iteratee中,消费者
Iteratee ByteString ma
在一定长度的名义ByteString上操作,同时可以访问单个ByteString
块。
In Data.Enumerator, a consumer Iteratee ByteString ma
operates on a notional [ByteString], with access to one or more elements (bytestrings) at one time. 在Data.Enumerator中,消费者
Iteratee ByteString ma
在名义上的[ByteString]上操作,同时可以访问一个或多个元素(字节串)。
This means that most Data.Iteratee operations are element-focused, that is with an Iteratee ByteString
they'll operate on a single Word8
, whereas Data.Enumerator operations are chunk-focused, operating on a ByteString
. 这意味着大多数Data.Iteratee操作都是以元素为中心的,即使用
Iteratee ByteString
它们将在单个Word8
,而Data.Enumerator操作是以块为中心的,在ByteString
。
You can think of Data.Iteratee.Iteratee [s] ma
=== Data.Enumerator.Iteratee sma
. 你可以想到
Data.Iteratee.Iteratee [s] ma
=== Data.Enumerator.Iteratee sma
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.