简体   繁体   English

GHC forkIO 双峰性能

[英]GHC forkIO bimodal performance

I was benchmarking forkIO with the following code:我正在使用以下代码对forkIO进行基准测试:

import System.Time.Extra
import Control.Concurrent
import Control.Monad
import Data.IORef


n = 200000

main :: IO ()
main = do
    bar <- newEmptyMVar
    count <- newIORef (0 :: Int)
    (d, _) <- duration $ do
        replicateM_ n $ do
            forkIO $ do
                v <- atomicModifyIORef' count $ \old -> (old + 1, old + 1)
                when (v == n) $ putMVar bar ()
        takeMVar bar
    putStrLn $ showDuration d

This spawns 20K threads, counts how many have run with an IORef , and when they have all started, finishes.这会产生 20K 线程,计算有多少使用IORef运行,当它们全部启动时,完成。 When run on GHC 8.10.1 on Windows with the command ghc --make -O2 Main -threaded && main +RTS -N4 the performance varies remarkably.当使用命令ghc --make -O2 Main -threaded && main +RTS -N4在 Windows 上的 GHC 8.10.1 上运行时,性能差异很大。 Sometimes it takes > 1s (eg 1.19s) and sometimes it takes < 0.1s (eg 0.08s).有时需要 > 1 秒(例如 1.19 秒),有时需要 < 0.1 秒(例如 0.08 秒)。 It seems that it is in the faster bucket about 1/6th of the time.似乎它大约有 1/6 的时间在更快的桶中。 Why the difference in performance?为什么性能差异? What causes it to go faster?是什么导致它更快地 go?

When I scale n up to 1M the effect goes away and it's always in the 5+s range.当我将n放大到 1M 时,效果就会消失,并且始终在 5+s 范围内。

I can confirm the same behavior on Ubuntu as well.我也可以在 Ubuntu 上确认相同的行为。 Except when I set n=1M this behavior does not go away and runtime ranges for me from 2 to 7 sec.除非我设置n=1M ,否则此行为不会远离 go 并且我的运行时间范围为 2 到 7 秒。

I believe non-determinism of the scheduler is the cause for such a significant variance in runtime.我相信调度程序的不确定性是导致运行时出现如此显着差异的原因。 This is not a definitive answer, of course, since it is merely my guess.当然,这不是一个确定的答案,因为这只是我的猜测。

atomicModifyIORef' is implemented with CAS (compare-and-swap), so depending on how threads are executed the function old + 1 will be recomputed more or less times. atomicModifyIORef'使用 CAS(比较和交换)实现,因此根据线程的执行方式,function old + 1将或多或少地重新计算。 In other words if a thread B updates the count ref before thread A gets a chance to update the count ref, but after it started the update, it will have to start over with the update operation from the beginning, thus reading new updated value from the ref and recomputing old + 1 once again.换句话说,如果线程 B 在线程 A 有机会更新count ref 之前更新了count ref,但在它开始更新之后,它将不得不从头开始更新操作,从而从ref 并再次重新计算old + 1

If you run main +RTS -N1 , you will see that not only it takes a lot less time to run the program, but also the runtime is pretty consistent between executions.如果您运行main +RTS -N1 ,您会发现不仅运行程序所需的时间要少得多,而且执行之间的运行时间也非常一致。 I suspect it is because only one thread can run at any time and there is no preemption until atomicModifyIORef' is done.我怀疑这是因为只有一个线程可以在任何时候运行,并且在atomicModifyIORef'完成之前没有抢占。

Hopefully someone else with deep understanding of Haskell RTS can provide more insight into this behavior, but that is my take on it.希望对 Haskell RTS 有深入了解的其他人可以提供对这种行为的更多见解,但这是我的看法。

Edit编辑

@NeilMitchel commented: @NeilMitchel 评论道:

I'm not convinced it's anything to do with the atomic modification at all我根本不相信这与原子修改有关

In order to prove that IORef is indeed at fault there, here is an implementation that uses PVar which relies on casIntArray# underneath.为了证明 IORef 确实存在错误,这里有一个使用PVar的实现,它依赖于下面的casIntArray# Not only it is 10 times faster, but there is no variance observed:它不仅快 10 倍,而且没有观察到差异:

import System.Time.Extra
import Control.Concurrent
import Control.Monad
import Data.Primitive.PVar -- from `pvar` package


n = 1000000

main :: IO ()
main = do
    bar <- newEmptyMVar
    count <- newPVar (0 :: Int)
    (d, _) <- duration $ do
        replicateM_ n $ do
            forkIO $ do
                v <- atomicModifyIntPVar count $ \old -> (old + 1, old + 1)
                when (v == n) $ putMVar bar ()
        takeMVar bar
    putStrLn $ showDuration d

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM