Check if one sample exists in the sample space (From PCA or other cluster analysis)

Question

I have a 200 by 50 matrix, here 200 means 200 compounds (row) and 50 means 50 independent varialbes (column), and then I use the 200 * 50 matrix to do cluster analysis (eg k-mean etc.), I can get a plot to show the distributions for these 2000 compounds.

My question is that when I have a new compound, which have the same 50 independent variable as the 200 * 50 matrix, how can I test if the new compound is located in the cluster space?

Thanks.

Edit: Plz note that I do not need find the element in the data.frame. I think the first step is to cluster the data (for example, using pca and plot(pca1, pca2)), then test the if the new record is located in the plot or out. Like this picture , where (2) belongs to the cluster and (1) does not belong to the cluster space, just like this.

Answer 1

So here is a different (but conceptually similar) approach, along with a cautionary tale. Since you did not provide any data, I'll use the built-in mtcars dataset for the example.

First, we set up the data, run principal components analysis, and run a K-means cluster analysis.

set.seed(5)                                   # for reprpduceable example
df    <- mtcars[,c(1,3,4,5,6,7)]              # subset of mtcars dataset
trn   <- sample(1:nrow(df),nrow(df)-3)          
train <- mtcars[trn,]                         # training set: 29 obs.
test  <- mtcars[-trn,]                        # test set: 3 obs.

pca <- prcomp(train, scale.=T, retx=T)        # pca on training set
summary(pca)$importance[3,1:4]                # 84% of variation in first 2 PC
# PC1     PC2     PC3     PC4 
# 0.60268 0.83581 0.89643 0.92139             
scores <- data.frame(pca$x)[1:2]              # so use first two PC
km     <- kmeans(scores,centers=3,nstart=25)  # kmeans cluster analysis

pc.test  <- predict(pca,test)[,1:2]           # transform the test set
pc.test  <- rbind(pc.test,c(-1.25,-1))        # add "special point"
rownames(pc.test) <- c(LETTERS[1:3],"X")      # letters to make things simpler

Now, we plot the PC, the centroids, and the test set.

library(ggplot2)
# plot first two PC with cluster id
gg.train  <- data.frame(cluster=factor(km$cluster), scores)
centroids <- aggregate(cbind(PC1,PC2)~cluster,data=gg.train,mean)
gg.train  <- merge(gg.train,centroids,by="cluster",suffixes=c("",".centroid"))
gg.test   <- data.frame(pc.test[,1:2])
# generate cluster plot...
cluster.plot <- ggplot(gg.train, aes(x=PC1, y=PC2, color=cluster)) +
  geom_point(size=3) +
  geom_point(data=centroids, size=4) +
  geom_segment(aes(x=PC1.centroid, y=PC2.centroid, xend=PC1, yend=PC2))+
  geom_point(data=gg.test,color="purple",size=8,shape=1)+
  geom_text(data=gg.test,label=rownames(gg.test),color="purple")+
  coord_fixed()
plot(cluster.plot)

Based on a visual examination, we'd likely place B and C in cluster 3 (the blue cluster) and A in cluster 1 (red). X is questionable (intentionally; that's what makes it "special"). But notice that if we assign to clusters based on proximity to the centroids, we would put A in cluster 3!

# "hard" prediction: assign to whichever cluster has closest centroid
predict.cluster <- function(z) {
  closest <-function(z)which.min(apply(km$centers,1,function(x,z)sum((x-z)^2),z))
  data.frame(pred.clust=apply(z,1,closest))
}
predict.cluster(pc.test)
#   pred.clust
# A          3
# B          3
# C          3
# X          2

So a different approach calculates the probability of cluster membership based on both distance from the centroid and a measure of scatter (how tightly grouped are the points in the cluster). This approach requires that we assume a distribution , which is risky, especially with a small number of points.

The simplest approach is to assume that the points in a given cluster follow a multivariate normal distribution. Under this assumption,

That is, a random variable formed as above is distributed as chi-sq with k degrees of freedom (where k is the number of dimension, here 2). Here, x is a point under consideration for membership in a cluster, μ is the cluster centroid, Σ is the covariance matrix for the points in the cluster, and χ ² is the chi-sq statistic with k degrees of freedom at probability 1-α.

We can use this to calculate the probability of membership by applying this equation, for a given x , (in the test set) to calculate α. We can also use this to calculate the "cluster boundaries" by calculating the set of points, x , which meet this condition for a given α. This latter exercise results in a confidence region of probability 1- α. Fortunately, this is already implemented in R (for 2 dimensions) using ellispe(...) in the ellipse package.

library(ellipse)
conf.rgn  <- do.call(rbind,lapply(1:3,function(i)
  cbind(cluster=i,ellipse(cov(scores[km$cluster==i,]),centre=km$centers[i,]))))
conf.rgn  <- data.frame(conf.rgn)
conf.rgn$cluster <- factor(conf.rgn$cluster)
plot(cluster.plot + geom_path(data=conf.rgn, aes(x=PC1,y=PC2)))

Based on this we would assign A to cluster 1 (red), even though it is closer to cluster 3 (blue). This is because cluster 3 is much more tightly grouped, so the hurdle for membership is higher. Note that X is "outside" of all the clusters.

The code below calculates the probability of membership in each cluster for a given set of test points.

# "soft" prediction: probability that point belongs in each cluster
pclust <- function(point,km,df){
  get.p <- function(clust,x){
    d         <- as.numeric(x-km$centers[clust,])
    sigma.inv <- solve(cov(df[km$cluster==clust,]))
    X.sq      <- d %*% sigma.inv %*% d
    p         <- pchisq(X.sq,length(d),lower.tail=FALSE)
  }
  sapply(1:max(km$cluster),get.p,x=point)
}
p <- apply(pc.test,1,pclust, km=km, df=scores)
print(p)
#                 A            B            C            X
# [1,] 9.178631e-02 6.490108e-04 9.969140e-07 8.754585e-04
# [2,] 1.720396e-28 4.391488e-26 2.821694e-43 3.630565e-05
# [3,] 2.664676e-05 8.928103e-01 8.660860e-02 2.188450e-05

Here the value in the i th row is the probability of membership in cluster i . So we can see that there is a 9.2% probability that A belongs in cluster 1, while the probability of membership in the other clusters is less than 0.003%. Similarly, B and C clearly belong in cluster 3 (p = 89.2% and 8.6% respectively). Finally, we can identify most likely clusters as follows:

data.frame(t(sapply(data.frame(p),function(x)
     list(cluster=which.max(x),p.value=x[which.max(x)]))))
#   cluster      p.value
# A       1   0.09178631
# B       3    0.8928103
# C       3    0.0866086
# X       1 0.0008754585

By assigning a cutoff value of p (say, 0.05), we can assert that a point does not belong in the "cluster space" (using your terminology) if the most likely cluster has p.value < cutoff .

The cautionary tale is that X gets excluded based in this analysis, even though is is quite near the grand mean of PC1 and PC2. This is because, while X sits in the middle of the dataset, it is in an "empty" region, where there are no clusters. Does this mean it should be excluded?

Answer 2

Here is a simple solution:

Step1: Setup the data

set.seed(1)
refData <- data.frame(matrix(runif(200*50),nrow=200))

newRec01 <- refData[11,]    # A record that exists in data
newRec02 <- runif(50)       # A record that does not exist in data

Step2: Testing:

TRUE %in% sapply(1:nrow(refData),function(i) all(newRec01 == refData[i,]))
TRUE %in% sapply(1:nrow(refData),function(i) all(newRec02 == refData[i,]))

If needed you can package it in a function:

checkNewRec <- function(refData, newRec) {
  TRUE %in% sapply(1:nrow(refData),function(i) all(newRec == refData[i,]))
}

checkNewRec(refData, newRec01)
checkNewRec(refData, newRec02)

EDIT: Based on your new input below, try the following:

Prep: Your code from the comments:

  ALL <- rbind(refData, newRec02) 

  pca <- prcomp(ALL) 
  pca1 <- pca$x[, 1] 
  pca2 <- pca$x[, 2] 
  pca1.in <- pca1[-length(pca1)]
  pca2.in <- pca2[-length(pca2)]

Now we need to define the cluster in some way. For simplicity, lets assume a single cluster.

Step1: Find out the centroid of the refData:

  cent <- c(mean(pca1.in),mean(pca2.in))

Step2: Find out the distance of all the data points from the center of refData:

  ssq <- (pca1 - mean(pca1.in))^2 + (pca2 - mean(pca2.in))^2

Step3: Now we need to choose a cut off distance from the center beyond which the new incoming record will be considered as "outside" the cluster. For simplicity, I am taking a dec ision for it to be at 95th % quantile:

  dec <- (quantile(head(ssq,-1), 0.95) > tail(ssq,1))

Step4: Now that a decision has been made on classification of newRec , we can plot it:

  plot(pca1, pca2) 
  points(pca1[length(pca1)], pca2[length(pca2)], 
         col = ifelse(dec, "red", "green"),pch="X")

Additionally, to verify our dec ision, lets plot the errors, and see where does the newRec fall!!

  hist(ssq, main="Error Histogram",xlab="Square Error")
  points(pca1[length(pca1)], pca2[length(pca2)],
         col = ifelse(dec, "red", "green"),pch="X")
  text(pca1[length(pca1)], pca2[length(pca2)],labels="New Rec",col="red",pos=3)

Hope this helps!!

Answer 3

This post follows jihoward 's answer, extending it to python and showing some interesting cluster assignment conditions under multivariate gaussians. The generated data is sampled from three 2d gaussian distributions, with means and covariances provided in the source code. Points W , X , Y , Z are used to describe soft assignment to clusters. We assume that each cluster has a chi-squared distribution with 2 degrees of freedom.

In the plot, the shaded areas represent 2 standard deviations from the mean. Note that as expected X does not belong to any cluster. Although Y is closer to the green centroid, Y is not assigned to the green cluster given its distribution. Consequences of using hard thresholding for cluster assignment are shown in the blue cluster. Note how points outside the 0.05 cutoff value would be classified as not belonging to the blue cluster.

Probabilities assuming a chi-squared distribution.

         Blue              Red             Green
W [  1.50465863e-01   0.00000000e+00   0.00000000e+00]
X [  2.44710474e-10   1.20447952e-05   0.00000000e+00]
Y [  0.00000000e+00   0.00000000e+00   0.00000000e+00]
Z [  0.00000000e+00   9.91055078e-01   0.00000000e+00]

Knowing that the data is multivariate gaussian, one could use scipy's implementation of Alan Genz's multivariate normal CDF functions. I was not able to get convincing results from it using this example. For more details on scipy's implementation check this link.

import numpy as np
import matplotlib.pylab as plt
from matplotlib.patches import Ellipse
from sklearn.cluster import KMeans
from scipy import stats
chi2_cdf = stats.chi2.cdf
plt.ion()

def eigenDecomposition(cov_mat):
    vals, vecs = np.linalg.eigh(cov_mat)
    order = vals.argsort()[::-1]
    return vals[order], vecs[:,order]


def plotEllipse(center, cov_mat, n_std, color):
    vals, vecs = eigenDecomposition(cov_mat)
    angle = np.degrees(np.arctan2(*vecs[:,0][::-1]))
    width, height = 2 * n_std * np.sqrt(vals)
    return Ellipse(xy=center, width=width, height=height, angle=angle, color=color, alpha=0.2)


def computeMembership(point, center, data):
    # (x - mu).T cov.inv (x - mu)
    cov_mat = np.cov(data.T)
    dist = np.array([point - center]).T
    X_sq = np.dot(dist.T, np.dot(np.linalg.inv(cov_mat), dist))
    return 1 - chi2_cdf(X_sq, len(center))[0][0]    

n_obs = 128
a = np.random.multivariate_normal((0, 0), [[1, 0], [0, 1]], n_obs)
b = np.random.multivariate_normal((10, 0), [[1, -0.9], [-0.9, 1]], n_obs)
c = np.random.multivariate_normal((10, 10), [[1, 0.05], [1, 0.05]], n_obs)
d = np.array([[0,2], [5, 5], [10, 9.5], [10, 0]])

markers = [r"$ {} $".format(lbl) for lbl in ('W', 'X', 'Y', 'Z')]
clustering = KMeans(n_clusters=3).fit(np.vstack((a, b, c)))
_, idx = np.unique(clustering.labels_, return_index=True)
ids = clustering.labels_[np.sort(idx)]
colors = 'rgb'

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(a[:,0], a[:,1], color=colors[ids[0]])
ax.scatter(b[:,0], b[:,1], color=colors[ids[1]])
ax.scatter(c[:,0], c[:,1], color=colors[ids[2]])
for i in xrange(len(d)):
    ax.scatter(d[i,0], d[i,1], color='k', s=128, marker=markers[i])
ax.scatter(clustering.cluster_centers_[:,0], clustering.cluster_centers_[:,1], color='k', marker='D')

# plot ellipses with 2 std
n_std = 2
probs = []
for i, data in enumerate((a, b, c)):
    ax.add_artist(plotEllipse(clustering.cluster_centers_[ids[i]], 
                              np.cov(data.T), 
                              n_std, 
                              color=colors[ids[i]]))
    probs.append([computeMembership(x, clustering.cluster_centers_[ids[i]], data) for x in d])
print np.array(probs).T

Check if one sample exists in the sample space (From PCA or other cluster analysis)

Question

3 answers

solution1
3 2014-04-25 17:25:27

solution2
1 ACCPTED 2014-04-23 19:59:50

solution3
1 2016-05-23 19:42:54

Check if one sample exists in the sample space (From PCA or other cluster analysis)

Question

3 answers

solution1 3 2014-04-25 17:25:27

solution2 1 ACCPTED 2014-04-23 19:59:50

solution3 1 2016-05-23 19:42:54

solution1
3 2014-04-25 17:25:27

solution2
1 ACCPTED 2014-04-23 19:59:50

solution3
1 2016-05-23 19:42:54