检查样本空间中是否存在一个样本（来自PCA或其他聚类分析）

Question

I have a 200 by 50 matrix, here 200 means 200 compounds (row) and 50 means 50 independent varialbes (column), and then I use the 200 * 50 matrix to do cluster analysis (eg k-mean etc.), I can get a plot to show the distributions for these 2000 compounds. 我有一个200 x 50的矩阵，这里200表示200个化合物（行），而50表示50个独立变量（列），然后我使用200 * 50矩阵进行聚类分析（例如k均值等），我可以得到一个图表以显示这2000种化合物的分布。

My question is that when I have a new compound, which have the same 50 independent variable as the 200 * 50 matrix, how can I test if the new compound is located in the cluster space? 我的问题是，当我有一个新化合物时，它具有与200 * 50矩阵相同的50个独立变量，如何测试新化合物是否位于簇空间中？

Thanks. 谢谢。

Edit: Plz note that I do not need find the element in the data.frame. 编辑：请注意，我不需要在data.frame中找到该元素。 I think the first step is to cluster the data (for example, using pca and plot(pca1, pca2)), then test the if the new record is located in the plot or out. 我认为第一步是对数据进行聚类（例如，使用pca和plot（pca1，pca2）），然后测试新记录是位于图中还是位于图中。 Like this picture , where (2) belongs to the cluster and (1) does not belong to the cluster space, just like this. 就像这张图片一样，其中（2）属于群集，而（1）不属于群集空间。

Answer 1

So here is a different (but conceptually similar) approach, along with a cautionary tale. 因此，这是一种不同的（但概念上相似）的方法，同时还有一个警示故事。 Since you did not provide any data, I'll use the built-in mtcars dataset for the example. 由于您未提供任何数据，因此在示例中将使用内置的mtcars数据集。

First, we set up the data, run principal components analysis, and run a K-means cluster analysis. 首先，我们设置数据，运行主成分分析，然后运行K-means聚类分析。

set.seed(5)                                   # for reprpduceable example
df    <- mtcars[,c(1,3,4,5,6,7)]              # subset of mtcars dataset
trn   <- sample(1:nrow(df),nrow(df)-3)          
train <- mtcars[trn,]                         # training set: 29 obs.
test  <- mtcars[-trn,]                        # test set: 3 obs.

pca <- prcomp(train, scale.=T, retx=T)        # pca on training set
summary(pca)$importance[3,1:4]                # 84% of variation in first 2 PC
# PC1     PC2     PC3     PC4 
# 0.60268 0.83581 0.89643 0.92139             
scores <- data.frame(pca$x)[1:2]              # so use first two PC
km     <- kmeans(scores,centers=3,nstart=25)  # kmeans cluster analysis

pc.test  <- predict(pca,test)[,1:2]           # transform the test set
pc.test  <- rbind(pc.test,c(-1.25,-1))        # add "special point"
rownames(pc.test) <- c(LETTERS[1:3],"X")      # letters to make things simpler

Now, we plot the PC, the centroids, and the test set. 现在，我们绘制PC，质心和测试集。

library(ggplot2)
# plot first two PC with cluster id
gg.train  <- data.frame(cluster=factor(km$cluster), scores)
centroids <- aggregate(cbind(PC1,PC2)~cluster,data=gg.train,mean)
gg.train  <- merge(gg.train,centroids,by="cluster",suffixes=c("",".centroid"))
gg.test   <- data.frame(pc.test[,1:2])
# generate cluster plot...
cluster.plot <- ggplot(gg.train, aes(x=PC1, y=PC2, color=cluster)) +
  geom_point(size=3) +
  geom_point(data=centroids, size=4) +
  geom_segment(aes(x=PC1.centroid, y=PC2.centroid, xend=PC1, yend=PC2))+
  geom_point(data=gg.test,color="purple",size=8,shape=1)+
  geom_text(data=gg.test,label=rownames(gg.test),color="purple")+
  coord_fixed()
plot(cluster.plot)

Based on a visual examination, we'd likely place B and C in cluster 3 (the blue cluster) and A in cluster 1 (red). 根据目视检查，我们可能会将B和C放在群集3（蓝色群集）中， A放在群集1（红色）中。 X is questionable (intentionally; that's what makes it "special"). X是有问题的（故意的；这就是使它“特殊”的原因）。 But notice that if we assign to clusters based on proximity to the centroids, we would put A in cluster 3! 但是请注意，如果我们基于与质心的接近度来分配聚类，则将A放入聚类3中！

# "hard" prediction: assign to whichever cluster has closest centroid
predict.cluster <- function(z) {
  closest <-function(z)which.min(apply(km$centers,1,function(x,z)sum((x-z)^2),z))
  data.frame(pred.clust=apply(z,1,closest))
}
predict.cluster(pc.test)
#   pred.clust
# A          3
# B          3
# C          3
# X          2

So a different approach calculates the probability of cluster membership based on both distance from the centroid and a measure of scatter (how tightly grouped are the points in the cluster). 因此，一种不同的方法根据距质心的距离和散布的度量（聚类中的点分组紧密程度）来计算聚类成员的概率。 This approach requires that we assume a distribution , which is risky, especially with a small number of points. 这种方法要求我们假设分布是危险的，尤其是点数少时。

The simplest approach is to assume that the points in a given cluster follow a multivariate normal distribution. 最简单的方法是假设给定群集中的点遵循多元正态分布。 Under this assumption, 在这个假设下，

That is, a random variable formed as above is distributed as chi-sq with k degrees of freedom (where k is the number of dimension, here 2). 即，如上形成的随机变量以具有k个自由度的chi-sq分布（其中k是维数，此处为2）。 Here, x is a point under consideration for membership in a cluster, μ is the cluster centroid, Σ is the covariance matrix for the points in the cluster, and χ ² is the chi-sq statistic with k degrees of freedom at probability 1-α. 这里， x是下考虑的一个点用于在群集成员，μ是群集的质心，Σ是对集群中的点的协方差矩阵，和^χ2是卡平方统计量，其中k自由度在概率1- α。

We can use this to calculate the probability of membership by applying this equation, for a given x , (in the test set) to calculate α. 对于给定的x （在测试集中），我们可以通过应用此方程来计算隶属概率，以计算α。 We can also use this to calculate the "cluster boundaries" by calculating the set of points, x , which meet this condition for a given α. 我们还可以通过计算点集x来计算“集群边界”，这些点对于给定的α满足此条件。 This latter exercise results in a confidence region of probability 1- α. 后面的练习导致置信区间为1-α。 Fortunately, this is already implemented in R (for 2 dimensions) using ellispe(...) in the ellipse package. 幸运的是，这已经在ellipse包中使用ellispe(...)在R（用于2维ellispe(...)中实现了。

library(ellipse)
conf.rgn  <- do.call(rbind,lapply(1:3,function(i)
  cbind(cluster=i,ellipse(cov(scores[km$cluster==i,]),centre=km$centers[i,]))))
conf.rgn  <- data.frame(conf.rgn)
conf.rgn$cluster <- factor(conf.rgn$cluster)
plot(cluster.plot + geom_path(data=conf.rgn, aes(x=PC1,y=PC2)))

Based on this we would assign A to cluster 1 (red), even though it is closer to cluster 3 (blue). 基于此，我们将A分配给群集1（红色），即使它更接近群集3（蓝色）。 This is because cluster 3 is much more tightly grouped, so the hurdle for membership is higher. 这是因为群集3的分组更为紧密，因此成员资格的障碍更高。 Note that X is "outside" of all the clusters. 请注意， X在所有群集的“外部”。

The code below calculates the probability of membership in each cluster for a given set of test points. 下面的代码针对给定的一组测试点计算每个群集中的成员资格概率。

# "soft" prediction: probability that point belongs in each cluster
pclust <- function(point,km,df){
  get.p <- function(clust,x){
    d         <- as.numeric(x-km$centers[clust,])
    sigma.inv <- solve(cov(df[km$cluster==clust,]))
    X.sq      <- d %*% sigma.inv %*% d
    p         <- pchisq(X.sq,length(d),lower.tail=FALSE)
  }
  sapply(1:max(km$cluster),get.p,x=point)
}
p <- apply(pc.test,1,pclust, km=km, df=scores)
print(p)
#                 A            B            C            X
# [1,] 9.178631e-02 6.490108e-04 9.969140e-07 8.754585e-04
# [2,] 1.720396e-28 4.391488e-26 2.821694e-43 3.630565e-05
# [3,] 2.664676e-05 8.928103e-01 8.660860e-02 2.188450e-05

Here the value in the i th row is the probability of membership in cluster i . 在这里，第i行中的值是群集i中隶属的概率。 So we can see that there is a 9.2% probability that A belongs in cluster 1, while the probability of membership in the other clusters is less than 0.003%. 因此，我们可以看到A属于群集1的概率为9.2％，而其他群集中的成员身份的概率小于0.003％。 Similarly, B and C clearly belong in cluster 3 (p = 89.2% and 8.6% respectively). 同样， B和C显然属于群集3（分别为p = 89.2％和8.6％）。 Finally, we can identify most likely clusters as follows: 最后，我们可以确定最可能的群集，如下所示：

data.frame(t(sapply(data.frame(p),function(x)
     list(cluster=which.max(x),p.value=x[which.max(x)]))))
#   cluster      p.value
# A       1   0.09178631
# B       3    0.8928103
# C       3    0.0866086
# X       1 0.0008754585

By assigning a cutoff value of p (say, 0.05), we can assert that a point does not belong in the "cluster space" (using your terminology) if the most likely cluster has p.value < cutoff . 通过指定p的截止值（例如0.05），我们可以断言，如果最可能的群集具有p.value < cutoff ，则该点不属于“群集空间”（使用您的术语）。

The cautionary tale is that X gets excluded based in this analysis, even though is is quite near the grand mean of PC1 and PC2. 告诫性的故事是，尽管该分析值与PC1和PC2的均值相当接近，但在此分析中X仍被排除在外。 This is because, while X sits in the middle of the dataset, it is in an "empty" region, where there are no clusters. 这是因为，尽管X位于数据集的中间，但它位于没有群集的“空”区域中。 Does this mean it should be excluded? 这是否意味着应该将其排除在外？

Answer 2

Here is a simple solution: 这是一个简单的解决方案：

Step1: Setup the data 步骤1：设定资料

set.seed(1)
refData <- data.frame(matrix(runif(200*50),nrow=200))

newRec01 <- refData[11,]    # A record that exists in data
newRec02 <- runif(50)       # A record that does not exist in data

Step2: Testing: 步骤2：测试：

TRUE %in% sapply(1:nrow(refData),function(i) all(newRec01 == refData[i,]))
TRUE %in% sapply(1:nrow(refData),function(i) all(newRec02 == refData[i,]))

If needed you can package it in a function: 如果需要，可以将其打包为一个函数：

checkNewRec <- function(refData, newRec) {
  TRUE %in% sapply(1:nrow(refData),function(i) all(newRec == refData[i,]))
}

checkNewRec(refData, newRec01)
checkNewRec(refData, newRec02)

EDIT: Based on your new input below, try the following: 编辑：根据下面的新输入，请尝试以下操作：

Prep: Your code from the comments: 准备：来自注释的代码：

  ALL <- rbind(refData, newRec02) 

  pca <- prcomp(ALL) 
  pca1 <- pca$x[, 1] 
  pca2 <- pca$x[, 2] 
  pca1.in <- pca1[-length(pca1)]
  pca2.in <- pca2[-length(pca2)]

Now we need to define the cluster in some way. 现在，我们需要以某种方式定义集群。 For simplicity, lets assume a single cluster. 为简单起见，让我们假设一个集群。

Step1: Find out the centroid of the refData: 步骤1：找出refData的质心：

  cent <- c(mean(pca1.in),mean(pca2.in))

Step2: Find out the distance of all the data points from the center of refData: 第二步：找出所有数据点到refData中心的距离：

  ssq <- (pca1 - mean(pca1.in))^2 + (pca2 - mean(pca2.in))^2

Step3: Now we need to choose a cut off distance from the center beyond which the new incoming record will be considered as "outside" the cluster. 步骤3：现在我们需要选择从中心的截止距离，超过该距离，新的传入记录将被视为集群的“外部”。 For simplicity, I am taking a dec ision for it to be at 95th % quantile: 为简单起见，我采取了dec ision它是在95％分位数：

  dec <- (quantile(head(ssq,-1), 0.95) > tail(ssq,1))

Step4: Now that a decision has been made on classification of newRec , we can plot it: 步骤4：既然已经对newRec分类做出决定，我们可以将其绘制：

  plot(pca1, pca2) 
  points(pca1[length(pca1)], pca2[length(pca2)], 
         col = ifelse(dec, "red", "green"),pch="X")

Additionally, to verify our dec ision, lets plot the errors, and see where does the newRec fall!! 此外，为了验证我们的dec ision，让剧情的错误，看看哪里的newRec下降！

  hist(ssq, main="Error Histogram",xlab="Square Error")
  points(pca1[length(pca1)], pca2[length(pca2)],
         col = ifelse(dec, "red", "green"),pch="X")
  text(pca1[length(pca1)], pca2[length(pca2)],labels="New Rec",col="red",pos=3)

Hope this helps!! 希望这可以帮助！！

Answer 3

This post follows jihoward 's answer, extending it to python and showing some interesting cluster assignment conditions under multivariate gaussians. 这篇文章遵循jihoward的回答，将其扩展到python并显示了在多元高斯条件下一些有趣的聚类分配条件。 The generated data is sampled from three 2d gaussian distributions, with means and covariances provided in the source code. 从三个2d高斯分布中采样生成的数据，并在源代码中提供均值和协方差。 Points W , X , Y , Z are used to describe soft assignment to clusters. 点W ， X ， Y ， Z用于描述对群集的软分配。 We assume that each cluster has a chi-squared distribution with 2 degrees of freedom. 我们假设每个群集具有2个自由度的卡方分布。

In the plot, the shaded areas represent 2 standard deviations from the mean. 在图中，阴影区域表示与平均值的2个标准偏差。 Note that as expected X does not belong to any cluster. 请注意，按预期，X不属于任何群集。 Although Y is closer to the green centroid, Y is not assigned to the green cluster given its distribution. 尽管Y更接近绿色质心，但鉴于其分布，Y并未分配给绿色簇。 Consequences of using hard thresholding for cluster assignment are shown in the blue cluster. 蓝色群集显示了使用硬阈值进行群集分配的结果。 Note how points outside the 0.05 cutoff value would be classified as not belonging to the blue cluster. 请注意，超出0.05截止值的点将如何分类为不属于蓝色簇。

Probabilities assuming a chi-squared distribution. 假设卡方分布的概率。

         Blue              Red             Green
W [  1.50465863e-01   0.00000000e+00   0.00000000e+00]
X [  2.44710474e-10   1.20447952e-05   0.00000000e+00]
Y [  0.00000000e+00   0.00000000e+00   0.00000000e+00]
Z [  0.00000000e+00   9.91055078e-01   0.00000000e+00]

Knowing that the data is multivariate gaussian, one could use scipy's implementation of Alan Genz's multivariate normal CDF functions. 知道数据是多元高斯的，因此可以使用scipy对Alan Genz多元正态CDF函数的实现。 I was not able to get convincing results from it using this example. 使用此示例，无法从中获得令人信服的结果。 For more details on scipy's implementation check this link. 有关scipy的实现的更多详细信息，请检查此链接。

import numpy as np
import matplotlib.pylab as plt
from matplotlib.patches import Ellipse
from sklearn.cluster import KMeans
from scipy import stats
chi2_cdf = stats.chi2.cdf
plt.ion()

def eigenDecomposition(cov_mat):
    vals, vecs = np.linalg.eigh(cov_mat)
    order = vals.argsort()[::-1]
    return vals[order], vecs[:,order]


def plotEllipse(center, cov_mat, n_std, color):
    vals, vecs = eigenDecomposition(cov_mat)
    angle = np.degrees(np.arctan2(*vecs[:,0][::-1]))
    width, height = 2 * n_std * np.sqrt(vals)
    return Ellipse(xy=center, width=width, height=height, angle=angle, color=color, alpha=0.2)


def computeMembership(point, center, data):
    # (x - mu).T cov.inv (x - mu)
    cov_mat = np.cov(data.T)
    dist = np.array([point - center]).T
    X_sq = np.dot(dist.T, np.dot(np.linalg.inv(cov_mat), dist))
    return 1 - chi2_cdf(X_sq, len(center))[0][0]    

n_obs = 128
a = np.random.multivariate_normal((0, 0), [[1, 0], [0, 1]], n_obs)
b = np.random.multivariate_normal((10, 0), [[1, -0.9], [-0.9, 1]], n_obs)
c = np.random.multivariate_normal((10, 10), [[1, 0.05], [1, 0.05]], n_obs)
d = np.array([[0,2], [5, 5], [10, 9.5], [10, 0]])

markers = [r"$ {} $".format(lbl) for lbl in ('W', 'X', 'Y', 'Z')]
clustering = KMeans(n_clusters=3).fit(np.vstack((a, b, c)))
_, idx = np.unique(clustering.labels_, return_index=True)
ids = clustering.labels_[np.sort(idx)]
colors = 'rgb'

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(a[:,0], a[:,1], color=colors[ids[0]])
ax.scatter(b[:,0], b[:,1], color=colors[ids[1]])
ax.scatter(c[:,0], c[:,1], color=colors[ids[2]])
for i in xrange(len(d)):
    ax.scatter(d[i,0], d[i,1], color='k', s=128, marker=markers[i])
ax.scatter(clustering.cluster_centers_[:,0], clustering.cluster_centers_[:,1], color='k', marker='D')

# plot ellipses with 2 std
n_std = 2
probs = []
for i, data in enumerate((a, b, c)):
    ax.add_artist(plotEllipse(clustering.cluster_centers_[ids[i]], 
                              np.cov(data.T), 
                              n_std, 
                              color=colors[ids[i]]))
    probs.append([computeMembership(x, clustering.cluster_centers_[ids[i]], data) for x in d])
print np.array(probs).T

检查样本空间中是否存在一个样本（来自PCA或其他聚类分析）

问题描述

3 个解决方案

解决方案1
3 2014-04-25 17:25:27

解决方案2
1 已采纳 2014-04-23 19:59:50

解决方案3
1 2016-05-23 19:42:54

检查样本空间中是否存在一个样本（来自PCA或其他聚类分析）

问题描述

3 个解决方案

解决方案1 3 2014-04-25 17:25:27

解决方案2 1 已采纳 2014-04-23 19:59:50

解决方案3 1 2016-05-23 19:42:54

解决方案1
3 2014-04-25 17:25:27

解决方案2
1 已采纳 2014-04-23 19:59:50

解决方案3
1 2016-05-23 19:42:54