简体   繁体   English

使用 numpy 进行快速蒙特卡罗模拟?

[英]Fast Monte-Carlo simulation with numpy?

I'm following the exercises from "Doing Bayesian Data Analysis" in both R and Python.我正在学习 R 和 Python 中“做贝叶斯数据分析”中的练习。

I would like to find a fast method of doing Monte-Carlo simulation that uses constant space .我想找到一种使用恒定空间进行蒙特卡罗模拟快速方法

The problem below is trivial, but serves as a good test for different methods:下面的问题是微不足道的,但可以很好地测试不同的方法:

ex 4.3例 4.3

Determine the exact probability of drawing a 10 from a shuffled pinochle deck.确定从洗牌的 pinochle 牌中抽出 10 的确切概率。 (In a pinochle deck, there are 48 cards. There are six values: 9, 10, Jack, Queen, King, Ace. There are two copies of each value in each of the standard four suits: hearts, diamonds, clubs, spades.) (在pinochle 套牌中,有48 张牌。有六个值:9、10、Jack、Queen、King、Ace。在标准的四种花色中,每个值都有两个副本:红心、菱形、梅花、黑桃.)

(A) What is the probability of getting a 10? (A) 得到 10 的概率是多少?

Of course, the answer is 1/6.当然,答案是 1/6。

The fastest solution I could find (comparable to the speed of R) is generating a large array of card draws using np.random.choice , then applying a Counter .我能找到的最快的解决方案(与 R 的速度相当)是使用np.random.choice生成大量抽np.random.choice ,然后应用Counter I don't like the idea of creating arrays unnecessarily, so I tried using a dictionary and a for loop, drawing one card at a time and incrementing the count for that type of card.我不喜欢不必要地创建数组的想法,所以我尝试使用字典和 for 循环,一次绘制一张卡片并增加该类型卡片的计数。 To my surprise, it was much slower!令我惊讶的是,它慢得多!

The full code is below for the 3 methods I tested.下面是我测试的 3 种方法的完整代码。 _Is there a way of doing this that will be as performant as method1(), but using constant space? _有没有一种方法可以与 method1() 一样高效,但使用常量空间?

Python code: ( Google Colab link ) Python 代码:Google Colab 链接

deck = [c for c in ['9','10','Jack','Queen','King','Ace'] for _ in range(8)]
num_draws = 1000000

def method1():
  draws = np.random.choice(deck, size=num_draws, replace=True)
  df = pd.DataFrame([Counter(draws)])/num_draws
  print(df)
  
def method2():
  card_counts = defaultdict(int)
  for _ in range(num_draws):
    card_counts[np.random.choice(deck, replace=True)] += 1
  df = pd.DataFrame([card_counts])/num_draws
  print(df)
  
def method3():
  card_counts = defaultdict(int)
  for _ in range(num_draws):
    card_counts[deck[random.randint(0, len(deck)-1)]] += 1
  df = pd.DataFrame([card_counts])/num_draws
  print(df)

Python timeit() results: Python timeit() 结果:

method1: 1.2997方法 1:1.2997

method2: 23.0626方法 2:23.0626

method3: 5.5859方法 3:5.5859

R code:代码:

card = sample(deck, numDraws, replace=TRUE)
print(as.data.frame(table(card)/numDraws))

Here's one with np.unique + np.bincount -这是一个np.unique + np.bincount -

def unique():    
    unq,ids = np.unique(deck, return_inverse=True)
    all_ids = np.random.choice(ids, size=num_draws, replace=True)
    ar = np.bincount(all_ids)/num_draws
    return pd.DataFrame(ar[None], columns=unq)

How does NumPy help here? NumPy 在这里有什么帮助?

There are two major improvements that's helping us here :有两项主要改进对我们有所帮助:

  1. We convert the string data to numeric.我们将字符串数据转换为数字。 NumPy works well with such data. NumPy 可以很好地处理此类数据。 To achieve this, we are using np.unique .为了实现这一点,我们使用np.unique

  2. We use np.bincount to replace the counting step.我们使用np.bincount来代替计数步骤。 Again, it works well with numeric data and we do have that from the numeric conversion done at the start of this method.同样,它适用于数字数据,我们确实通过在此方法开始时完成的数字转换获得了这一点。

  3. NumPy in general works well with large data, which is the case here. NumPy 通常适用于大数据,这里就是这种情况。


Timings with given sample dataset comparing against fastest method1 -计时与给定的样本数据集的比较反对最快method1 -

In [177]: %timeit method1()
328 ms ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [178]: %timeit unique()
12.4 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy achieves efficiency by running C code in its numerical engine. Numpy 通过在其数值引擎中运行 C 代码来提高效率。 Python is convenient, but it is orders of magnitude slower than C. Python 很方便,但它比 C 慢几个数量级。

In Numpy and other high-performance Python libraries, the Python code consists mostly of glue code, preparing the task to be dispatched.在 Numpy 和其他高性能 Python 库中,Python 代码主要由胶水代码组成,用于准备要调度的任务。 Since there is overhead, it is much faster to draw a lot of samples at once.由于存在开销,因此一次绘制大量样本要快得多。

Remember that providing a buffer of 1 million elements for Numpy to work is still constant space.请记住,为 Numpy 工作提供 100 万个元素的缓冲区仍然是恒定空间。 Then you can sample 1 billion times by looping it.然后你可以通过循环采样 10 亿次。

This extra memory allocation is usually not a problem.这种额外的内存分配通常不是问题。 If you must avoid using memory at all costs while still getting performance benefits from Numpy, you can try using Numba or Cython to accelerate it.如果您必须不惜一切代价避免使用内存同时仍能从 Numpy 中获得性能优势,您可以尝试使用 Numba 或 Cython 来加速它。

from numba import jit
@jit(nopython=True)
def method4():
    card_counts = np.zeros(6)
    for _ in range(num_draws):
        card_counts[np.random.randint(0, 6)] += 1
    return card_counts/num_draws

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM