可变，（可能是并行）Haskell代码和性能调优

Question

I have now implemented another SHA3 candidate, namely Grøstl. 我现在已经实现了另一个 SHA3候选者，即Grøstl。 This is still work in progress (very much so), but at the moment a 224-bit version pass all KATs. 这仍然在进行中（非常如此），但目前224位版本通过了所有KAT。 So now I'm wondering about performance (again :->). 所以现在我想知道性能（再次： - >）。 The difference this time, is that I chose to more closely mirror the (optimized) C implementation , ie I made a port from C to Haskell. 这次的不同之处在于，我选择更接近地镜像（优化的）C实现，即我创建了一个从C到Haskell的端口。 The optimized C version use table-lookups to implement the algorithm. 优化的C版本使用表查找来实现该算法。 Furthermore the code is heavily based on updating an array containing 64-bit words. 此外，代码主要基于更新包含64位字的数组。 Thus I chose to use mutable unboxed vectors in Haskell. 因此，我选择在Haskell中使用可变的无盒载体。

My Grøstl code can be found here: https://github.com/hakoja/SHA3/blob/master/Data/Digest/GroestlMutable.hs 我的Grøstl代码可以在这里找到： https ： //github.com/hakoja/SHA3/blob/master/Data/Digest/GroestlMutable.hs

Short description of the algorithm: It's a Merkle-Damgård construction, iterating a compression function ( f512M in my code) as long as there are 512-bits blocks of message left. 该算法的简短描述：它是一个Merkle-Damgård构造，只要有512位的消息块，就迭代一个压缩函数（在我的代码中为f512M ）。 The compression function is very simple: it simply runs two different independent 512-bit permutations P and Q ( permP and permQ in my code) and combines their output. 压缩函数非常简单：它只运行两个不同的独立512位排列P和Q （我的代码中的permP和permQ ）并组合它们的输出。 Its these permutations which are implemented by lookup tables. 它的这些排列是由查找表实现的。

Q1) The first thing that bothers me is that the use of mutable vectors makes my code look really fugly. Q1）困扰我的第一件事是使用可变向量使我的代码看起来非常难看。 This is my first time writing any major mutable code in Haskell so I don't really know how to improve this. 这是我第一次在Haskell中编写任何主要的可变代码，所以我真的不知道如何改进它。 Any tips on how I might better strucure the monadic code would be welcome. 关于如何更好地构建monadic代码的任何提示都将受到欢迎。

Q2) The second is performance. Q2）第二是表现。 Actually It's not too bad, because at the moment the Haskell code is only 3 times slower. 实际上它并不太糟糕，因为目前Haskell代码只慢了3倍。 Using GHC-7.2.1 and compiling as such: 使用GHC-7.2.1并编译如下：

ghc -O2 -Odph -fllvm -optlo-O3 -optlo-loop-reduce -optlo-loop-deletion ghc -O2 -Odph -fllvm -optlo-O3 -optlo-loop-reduce -optlo-loop-deletion

the Haskell code uses 60s. Haskell代码使用60秒。 on an input of ~1GB, while the C-version uses 21-22s. 输入约为1GB，而C版本使用21-22s。 But there are some things I find odd: 但有一些我觉得奇怪的事情：

(1) If I try to inline rnd512QM , the code takes 4 times longer, but if I inline rnd512PM nothing happens! （1）如果我尝试内联rnd512QM ，代码需要4倍，但如果我内联rnd512PM没有任何反应！ Why is this happening? 为什么会这样？ These two functions are virtually identical! 这两个功能几乎相同！

(2) This is maybe more difficult. （2）这可能更难。 I've been experimenting with executing the two permutations in parallel. 我一直在尝试并行执行两个排列。 But currently to no avail. 但目前无济于事。 This is one example of what I tried: 这是我尝试过的一个例子：

f512 h m = V.force outP `par` (V.force outQ `pseq` (V.zipWith3 xor3 h outP outQ))
   where xor3 x1 x2 x3 = x1 `xor` x2 `xor` x3
         inP = V.zipWith xor h m
         outP = permP inP
         outQ = permQ m

When checking the run-time statistics, and using ThreadScope, I noticed that the correct number of SPARKS was created, but almost none was actually converted to useful parallel work. 在检查运行时统计信息并使用ThreadScope时，我注意到创建了正确数量的SPARKS，但几乎没有实际转换为有用的并行工作。 Thus I gained nothing in speedup. 因此，我在加速方面一无所获。 My question then becomes: 我的问题变成了：

Are the P and Q functions just too small for the runtime to bother to run in parallel? P和Q函数是否太小而运行时无法并行运行？
If not, is my use of par and pseq (and possibly Vector.Unboxed.force) wrong? 如果没有，我使用par和pseq （可能还有Vector.Unboxed.force）是错误的吗？
Would I gain anything by switching to strategies? 转换到策略会获得任何收益吗？ And how would I go about doing that? 那我该怎么做呢？

Thank you so much for your time. 非常感谢您的参与。

EDIT: 编辑：

Sorry for not providing any real benchmark tests. 很抱歉没有提供任何真正的基准测试。 The testing code in the repo was just intended for myself only. 回购中的测试代码仅供我自己使用。 For those wanting to test the code out, you will need to compile main.hs , and then run it as: 对于那些想要测试代码的人，你需要编译main.hs ，然后运行它：

./main "algorithm" "testvariant" "byte aligned" ./main“algorithm”“testvariant”“字节对齐”

For instance: 例如：

./main groestl short224 False ./main groestl short224错误

or 要么

./main groestl e False ./main groestl e False

( e stands for "Extreme". It's the very long message provided with the NIST KATS). （ e代表“极端”。这是NIST KATS提供的非常长的消息）。

Answer 1

I checked out the repo, but there's no simple benchmark to just run and play with, so my ideas are just from eyeballing the code. 我检查了回购，但没有简单的基准来运行和玩，所以我的想法只是从眼睛的代码。 Numbering is unrelated to your questions. 编号与您的问题无关。

1) I'm pretty sure force doesn't do what you want -- it actually forces a copy of the underlying vector. 1）我很确定force没有做你想要的 - 它实际上强制了底层矢量的副本。

2) I think the use of unsafeThaw and unsafeFreeze is sort of odd. 2）我认为使用unsafeThaw和unsafeFreeze有点奇怪。 I'd just put f512M in the ST monad and be done with it. 我只是将f512M放入ST monad并完成它。 Then run it something like so: 然后运行它是这样的：

otherwise = \msg -> truncate G224 . outputTransformation . runST $ foldM f512M h0_224 (parseMessage dataBitLen 512 msg)

3) V.foldM' is sort of silly -- you can just use a normal (strict) foldM over a list -- folding over the vector in the second argument doesn't seem to buy anything. 3） V.foldM'有点傻 - 你可以在列表上使用正常（严格）foldM - 在第二个参数中折叠向量似乎不买任何东西。

4) i'm dubious about the bangs in columnM and for the unsafeReads. 4）我对columnM的刘海和unsafeReads表示怀疑。

Also... 也...

a) I suspect that xoring unboxed vectors can probably be implemented at a lower level than zipWith , making use of Data.Vector internals. a）我怀疑xoring未装箱的矢量可能比zipWith更低的级别zipWith ，利用Data.Vector内部。

b) However, it may be better not to do this as it could interfere with vector fusion. b）但是，最好不要这样做，因为它可能会干扰矢量融合。

c) On inspection, extractByte looks slightly inefficient? c）在检查时， extractByte看起来效率不高？ Rather than using fromIntegral to truncate, maybe use mod or quot and then a single fromIntegral to take you directly to an Int. 而不是使用fromIntegral来截断，可以使用mod或quot然后使用单个fromIntegral直接转到Int。

Answer 2

Be sure to compile with -threaded -rtsopts and execute with +RTS -N2 . 确保使用-threaded -rtsopts进行编译并使用+RTS -N2执行。 Without that, you won't have more than one OS thread to perform computations. 没有它，您将不会有多个OS线程来执行计算。
Try to spark computations that are referred to elsewhere, otherwise they might be collected: 尝试激发其他地方引用的计算，否则可能会收集它们：

_ _

f512 h m = outP `par` (outQ `pseq` (V.zipWith3 xor3 h outP outQ))
   where xor3 x1 x2 x3 = x1 `xor` x2 `xor` x3
         inP = V.zipWith xor h m
         outP = V.force $ permP inP
         outQ = V.force $ permQ m

_ _

3) If you switch things up so parseBlock accepts strict bytestrings (or chunks and packs lazy ones when needed) then you can use Data.Vector.Storable and potentially avoid some copying. 3）如果你把事情parseBlock了，那么parseBlock接受严格的字节parseBlock （或者在需要的时候使用chunk和pack lazy）然后你可以使用Data.Vector.Storable并且可能避免一些复制。

可变，（可能是并行）Haskell代码和性能调优

问题描述

2 个解决方案

解决方案1
3 2011-11-16 20:12:09

解决方案2
1 2011-11-16 19:59:11

可变，（可能是并行）Haskell代码和性能调优

问题描述

2 个解决方案

解决方案1 3 2011-11-16 20:12:09

解决方案2 1 2011-11-16 19:59:11

解决方案1
3 2011-11-16 20:12:09

解决方案2
1 2011-11-16 19:59:11