如何提高Sklearn GMM预测（）性能速度？

Question

I am using Sklearn to estimate the Gaussian Mixture Model (GMM) on some data. 我正在使用Sklearn来估计某些数据的高斯混合模型（GMM）。

After the estimation, I have many query points. 估计之后，我有很多查询点。 I would like to obtain their probabilities of belonging to each of the estimated Gaussian. 我想获得属于每个估计高斯的概率。

The code below works. 以下代码有效。 However, the gmm_sk.predict_proba(query_points) part is very slow as I need to run it multiple times on 100000 sets of samples, where each sample contains 1000 points. 但是， gmm_sk.predict_proba(query_points)部分非常慢，因为我需要在100000组样本上多次运行它，其中每个样本包含1000个点。

I guess that it happens because it is sequential. 我想这是因为它是连续的。 Is there a way to make it parallel? 有没有办法让它平行？ Or any other way to make it faster? 或任何其他方式使其更快？ Maybe on GPU using TensorFlow ? 也许在GPU上使用TensorFlow ？

I saw TensorFlow has its own GMM algorithm but it was very hard to implement. 我看到TensorFlow有自己的GMM算法，但实现起来非常困难。

Here is the code that I have written: 这是我写的代码：

import numpy as np
from sklearn.mixture import GaussianMixture
import time


n_gaussians = 1000
covariance_type = 'diag'
points = np.array(np.random.rand(10000, 3), dtype=np.float32)
query_points = np.array(np.random.rand(1000, 3), dtype=np.float32)
start = time.time()

#GMM with sklearn
gmm_sk = GaussianMixture(n_components = n_gaussians, covariance_type=covariance_type)
gmm_sk.fit(points)
mid_t = time.time()
elapsed = time.time() - start
print("learning took "+ str(elapsed))

temp = []
for i in range(2000):
    temp.append(gmm_sk.predict_proba(query_points))

end_t = time.time() - mid_t
print("predictions took " + str(end_t))

I solved it ! 我解决了！ using multiprocessing . 使用multiprocessing 。 just replaced 刚换下来

temp = []
for i in range(2000):
    temp.append(gmm_sk.predict_proba(query_points))

with 同

import multiprocessing as mp
    query_points = query_points.tolist()
    parallel = mp.Pool()
    fv = parallel.map(par_gmm, query_points)
    parallel.close()
    parallel.join()

Answer 1

You could speed up the process if you fit with the 'diagonal' or spherical covariance matrix instead of full. 如果您使用“对角线”或球形协方差矩阵而不是完整，则可以加快该过程。

Use: 采用：

covariance_type='diag'

or 要么

covariance_type='spherical'

inside GaussianMixture 在GaussianMixture里面

Also, try to decrease the Gaussian components . 另外，尝试减少高斯分量。

However, keep in mind that this may affect the results but I cannot see other way to speed up the process. 但是，请记住，这可能会影响结果，但我看不到其他方法来加快这个过程。

Answer 2

I see that your number of Gaussian components in the GMM is 1000, which, I think is a very large number, given your data dimensionality is relatively low (3). 我看到你的GMM中高斯分量的数量是1000，我认为这是一个非常大的数字，因为你的数据维数相对较低（3）。 This is probably the reason that it runs slow, since it needs to evaluate 1000 separate Gaussians. 这可能是它运行缓慢的原因，因为它需要评估1000个单独的高斯。 If your sample count is low, then this is also very prone to overfitting. 如果您的样本数量很少，那么这也很容易过度拟合。 You can try a lower number of components, which will naturally be faster and will most likely generalize better. 您可以尝试使用较少数量的组件，这些组件自然会更快，并且最有可能更好地概括。

如何提高Sklearn GMM预测（）性能速度？

问题描述

2 个解决方案

解决方案1
0 2017-07-06 17:15:19

解决方案2
0 2017-07-06 17:23:51

如何提高Sklearn GMM预测（）性能速度？

问题描述

2 个解决方案

解决方案1 0 2017-07-06 17:15:19

解决方案2 0 2017-07-06 17:23:51

解决方案1
0 2017-07-06 17:15:19

解决方案2
0 2017-07-06 17:23:51