高斯混合模型（GMM）不合適

Question

我一直在使用Scikit-learn的GMM功能。 首先，我剛剛沿着x=y線創建了一個分布。

from sklearn import mixture
import numpy as np 
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

line_model = mixture.GMM(n_components = 99)
#Create evenly distributed points between 0 and 1.
xs = np.linspace(0, 1, 100)
ys = np.linspace(0, 1, 100)

#Create a distribution that's centred along y=x
line_model.fit(zip(xs,ys))
plt.plot(xs, ys)
plt.show()

這將產生預期的分布：

接下來，我將其適合GMM並繪制結果：

#Create the x,y mesh that will be used to make a 3D plot
x_y_grid = []
for x in xs:
    for y in ys:
        x_y_grid.append([x,y])

#Calculate a probability for each point in the x,y grid.
x_y_z_grid = []
for x,y in x_y_grid:
    z = line_model.score([[x,y]])
    x_y_z_grid.append([x,y,z])

x_y_z_grid = np.array(x_y_z_grid)

#Plot probabilities on the Z axis.
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot(x_y_z_grid[:,0], x_y_z_grid[:,1], 2.72**x_y_z_grid[:,2])
plt.show()

所得的概率分布在x=0和x=1有一些怪異的尾巴，在拐角處還有額外的概率（x = 1，y = 1，x = 0，y = 0）。 概率分布n = 99

使用n_components = 5也會顯示此行為： 概率分布n = 5

這是GMM所固有的，還是實現方面存在問題，或者我做錯了什么？

編輯：從模型獲得分數似乎擺脫了這種行為-這應該是嗎？

我正在同一數據集上訓練兩個模型（x = y從x = 0到x = 1）。 簡單地通過gmm的score方法檢查概率似乎可以消除這種邊界效應。 為什么是這樣？ 我已經附上了下面的圖和代碼。

檢查不同域上的分數會影響分布。

# Creates a line of 'observations' between (x_small_start, x_small_end)
# and (y_small_start, y_small_end). This is the data both gmms are trained on.
x_small_start = 0
x_small_end = 1
y_small_start = 0
y_small_end = 1

# These are the range of values that will be plotted
x_big_start = -1
x_big_end = 2
y_big_start = -1
y_big_end = 2


shorter_eval_range_gmm = mixture.GMM(n_components = 5)
longer_eval_range_gmm = mixture.GMM(n_components = 5)

x_small = np.linspace(x_small_start, x_small_end, 100)
y_small = np.linspace(y_small_start, y_small_end, 100)
x_big = np.linspace(x_big_start, x_big_end, 100)
y_big = np.linspace(y_big_start, y_big_end, 100)

#Train both gmms on a distribution that's centered along y=x
shorter_eval_range_gmm.fit(zip(x_small,y_small))
longer_eval_range_gmm.fit(zip(x_small,y_small))


#Create the x,y meshes that will be used to make a 3D plot
x_y_evals_grid_big = []
for x in x_big:
    for y in y_big:
        x_y_evals_grid_big.append([x,y])
x_y_evals_grid_small = []

for x in x_small:
    for y in y_small:
        x_y_evals_grid_small.append([x,y])

#Calculate a probability for each point in the x,y grid.
x_y_z_plot_grid_big = []
for x,y in x_y_evals_grid_big:
    z = longer_eval_range_gmm.score([[x, y]])
    x_y_z_plot_grid_big.append([x, y, z])
x_y_z_plot_grid_big = np.array(x_y_z_plot_grid_big)

x_y_z_plot_grid_small = []
for x,y in x_y_evals_grid_small:
    z = shorter_eval_range_gmm.score([[x, y]])
    x_y_z_plot_grid_small.append([x, y, z])
x_y_z_plot_grid_small = np.array(x_y_z_plot_grid_small)


#Plot probabilities on the Z axis.
fig = plt.figure()
fig.suptitle("Probability of different x,y pairs")

ax1 = fig.add_subplot(1, 2, 1, projection='3d')
ax1.plot(x_y_z_plot_grid_big[:,0], x_y_z_plot_grid_big[:,1], np.exp(x_y_z_plot_grid_big[:,2]))
ax1.set_xlabel('X Label')
ax1.set_ylabel('Y Label')
ax1.set_zlabel('Probability')
ax2 = fig.add_subplot(1, 2, 2, projection='3d')
ax2.plot(x_y_z_plot_grid_small[:,0], x_y_z_plot_grid_small[:,1], np.exp(x_y_z_plot_grid_small[:,2]))
ax2.set_xlabel('X Label')
ax2.set_ylabel('Y Label')
ax2.set_zlabel('Probability')

plt.show()

Answer 1

配合沒有問題，但您使用的是可視化。 提示應該是將（0,1,5）連接到（0,1,0）的直線，它實際上只是兩個點的連接的呈現（這是由於點的讀取順序所致）。 盡管極值的兩個點都在您的數據中，但實際上這條線上沒有其他點。

就個人而言，出於上述原因，我認為使用3d圖（導線）表示表面是一個相當糟糕的主意，我建議使用表面圖或輪廓圖。

嘗試這個：

from sklearn import mixture
import numpy as np 
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

line_model = mixture.GMM(n_components = 99)
#Create evenly distributed points between 0 and 1.
xs = np.atleast_2d(np.linspace(0, 1, 100)).T
ys = np.atleast_2d(np.linspace(0, 1, 100)).T

#Create a distribution that's centred along y=x
line_model.fit(np.concatenate([xs, ys], axis=1))
plt.scatter(xs, ys)
plt.show()

#Create the x,y mesh that will be used to make a 3D plot
X, Y = np.meshgrid(xs, ys)
x_y_grid = np.c_[X.ravel(), Y.ravel()]

#Calculate a probability for each point in the x,y grid.
z = line_model.score(x_y_grid)
z = z.reshape(X.shape)

#Plot probabilities on the Z axis.
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, z)
plt.show()

從學術角度來看，我對使用2D混合模型在2D空間中擬合1D線的目標感到非常不自在。 使用GMM進行流形學習至少需要法線方向具有零方差，從而減小了狄拉克分布。 從數值和分析上講，這是不穩定的，應避免使用（gmm擬合似乎有一些穩定技巧，因為模型在垂直於直線法線的方向上的差異很大）。

還建議在繪制數據時使用plt.scatter而不是plt.plot ，因為在擬合點的聯合分布時沒有理由連接點。

希望這有助於闡明您的問題。

Answer 2

編輯：這是不正確的。 與羅納德·P。（Ronald P.）交談時，您無法獲得吉布斯效應，因為高斯人無法通過“趨於負”來相互補償，因為概率嚴格大於0。這似乎是一個簡單的繪圖問題，請參閱他的答案！ 無論哪種方式，我都建議使用2D數據而不是1D線來測試GMM。

~~GMM適合您提供的數據-特別是：~~

 xs = np.linspace(0, 1, 100) ys = np.linspace(0, 1, 100)

~~由於數據在0和1 處結束 ，因此GMM試圖對該事實進行建模：-.01和1.01在技術上不在訓練數據范圍內，因此應以極低的概率進行評分。~~ ~~這樣做最終會創建一個具有較小分布（較小的協方差/較高的精度）的高斯，以覆蓋數據的兩端並為數據停止的事實建模。~~

~~我希望添加足夠的高斯會導致偽Gibbs現象的影響，並且您可以看到這種情況發生在從5到99的變化中。要精確建模邊緣，您將需要無限的混合模型。~~ ~~這類似於無限頻率分量-您也用GMM中的一組基函數（在本例中為高斯）表示“信號”！~~

高斯混合模型（GMM）不合適

問題描述

編輯：從模型獲得分數似乎擺脫了這種行為-這應該是嗎？

2 個解決方案

解決方案1
4 已采納 2014-06-24 07:19:33

解決方案2
1 2014-06-20 07:41:10

高斯混合模型（GMM）不合適

問題描述

編輯：從模型獲得分數似乎擺脫了這種行為-這應該是嗎？

2 個解決方案

解決方案1 4 已采納 2014-06-24 07:19:33

解決方案2 1 2014-06-20 07:41:10

解決方案1
4 已采納 2014-06-24 07:19:33

解決方案2
1 2014-06-20 07:41:10