自举自己的内置函数pvclust不起作用

Question

I am using sequence analyses methods in order to measure similarity between different "sequences of spatial use", represented as strings of characters. 我正在使用序列分析方法，以测量不同的“空间使用序列”之间的相似性，以字符串表示。 Here is a theoretical example with three classes (A: City, B: Agriculture, C: Mountain) for two sequences: 这是一个理论示例，其中包含两个序列的三个类别（A：城市，B：农业，C：山地）：

  t1,t2,........tx Individual 1: AAABBBCC Individual 2: ABBBAACC 0 1 1 0 1 1 0 0 = **4**

The distance measure that we use to measure the similarity among the sequences is the hamming distance (ie measures how often a character in a sequence needs to be substituted in order to equate the sequences, in the example above 4 characters need to be substituted in order to equate the sequences). 我们用来测量序列之间相似度的距离度量是汉明距离（即测量序列中的一个字符需要被替换的频率以使序列相等），在上面的示例中，需要替换四个字符以使序列相等）。 Based on our distance matrix (giving the distance, or dissimilarity, of every possible pair of sequences) obtained after calculating the hamming distance a dendrogram has been created using the clustering method of Ward (ward.D2). 在计算汉明距离后，根据我们的距离矩阵（给出每个可能的序列对的距离或不相似性），使用沃德（ward.D2）的聚类方法创建了树状图。

Now I would also like to include a good measure of cluster robustness in order to identify relevant clusters. 现在，我还想对集群健壮性进行一个很好的衡量，以便确定相关的集群。 For this I was trying to use pvclust which contains several methods to calculate bootstrap values, however restricted to a number of distance measures. 为此，我试图使用pvclust，其中包含几种计算引导程序值的方法，但是仅限于许多距离度量。 With the unreleased version of pvclust I tried to implement the right distance measure (ie hamming distance) and I tried to create a bootstrapping tree. 使用pvclust的未发布版本，我尝试实现正确的距离度量（即汉明距离），并且尝试创建自举树。 The script is working, but the outcome is not correct. 该脚本正在运行，但是结果不正确。 Applied on my dataset using a nboot of 1000, "bp" values are close to 0 and all the other values "au", "se.au", "se.bp", "v", "c", "pchi" are 0, suggesting that the clusters are artefacts. 使用1000的nboot应用于我的数据集，“ bp”值接近0，所有其他值“ au”，“ se.au”，“ se.bp”，“ v”，“ c”，“ pchi”为0，表明这些簇是人工制品。

Here I provide an example script: 在这里，我提供了一个示例脚本：

The data concerns simulated sequences that are very homogeneous (eg continues used of 1 specific state), so that each cluster should certainly be significant. 数据涉及非常均匀的模拟序列（例如，继续使用一种特定状态），因此每个聚类肯定是重要的。 I limited the number of boots to only 10 to limit calculation time. 我将靴子的数量限制为仅10个，以限制计算时间。

####################################################################
####Create the sequences#### 
dfr = data.frame()
a = list(dfr)
b = list(dfr)
c = list(dfr)
d = list(dfr)
data = list(dfr)

for (i in c(1:10)){
set.seed(i)
a[[i]] <- sample(c(rep('A',10),rep('B', 90)))
b[[i]] <- sample(c(rep('B',10),rep('A', 90)))
c[[i]] <- sample(c(rep('C',10),rep('D', 90)))
d[[i]] <- sample(c(rep('D',10),rep('C', 90)))
}
a = as.data.frame(a, header = FALSE)
b = as.data.frame(b, header = FALSE)
c = as.data.frame(c, header = FALSE)
d = as.data.frame(d, header = FALSE)

colnames(a) <- paste(rep('seq_urban'),rep(1:10), sep ='')
colnames(b) <- paste(rep('seq_agric'),rep(1:10), sep ='')
colnames(c) <- paste(rep('seq_mount'),rep(1:10), sep ='')
colnames(d) <- paste(rep('seq_sea'),rep(1:10), sep ='')

data = rbind(t(a),t(b),t(c),t(d))
#####################################################################

####Analysis####
## install packages if necessary
#install.packages(c("TraMineR", "devtools")) 
library(TraMineR)
library(devtools)

source_url("https://www.dropbox.com/s/9znkgks1nuttlxy/pvclust.R?dl=0") # url    to my dropbox for unreleased pvclust package
source_url("https://www.dropbox.com/s/8p6n5dlzjxmd6jj/pvclust-internal.R?dl=0") # url to my dropbox for unreleased pvclust package

dev.new()
par( mfrow = c(1,2))
## Color definitions and alphabet/labels/scodes for sequence definition
palet <- c(rgb(230, 26, 26, max = 255), rgb(230, 178, 77, max = 255),     "blue", "deepskyblue2") # color palet used for the states
s.alphabet <- c("A", "B", "C", "D") # the alphabet of the sequence object
s.labels <- c("country-side", "urban", "sea", "mountains") # the labels of    the sequence object
s.scodes <- c( "A", "U", "S", "M") # the states of the sequence object

## Sequence definition
seq_ <- seqdef(data, # data  
                  1:100, # columns corresponding to the sequence data  
                  id = rownames(data), # id of the sequences
                  alphabet = s.alphabet, states = s.scodes, labels = s.labels, 
                  xtstep = 6, 
                  cpal = palet) # color palet 

##Substitution matrix used to calculate the hamming distance
Autocor <- seqsubm(seq_, method = "TRATE", with.missing = FALSE) 

# Function with the hamming distance (i.e. counts how often a character  needs to be substituted to equate two sequences to each other. Result is a  distance matrix giving the distances for each pair of sequences)
hamming <- function(x,...) {
res <- seqdist(x, method = "HAM",sm = Autocor)
res <- as.dist(res)
attr(res, "method") <- "hamming"
return(res)
}

## Perform the bootstrapping using the distance method "hamming"
result <- pvclust(seq_, method.dist = hamming, nboot = 10, method.hclust =  "ward")
result$hclust$labels <- rownames(test[,1])
plot(result)

To do this analysis I am using the unreleased version of the R package pvclust, which allows to use your own distance method (in this case: hamming). 为了进行此分析，我使用R软件包pvclust的未发布版本，该版本允许使用您自己的distance方法（在本例中为hamming）。 Does somebody has an idea how to solve this problem? 有人知道如何解决这个问题吗？

Answer 1

The aim of pvclust is to cluster variables (or attributes) not cases. pvclust的目的是聚类变量 （或属性）而不是案例。 This is why you have results that do not make sense. 这就是为什么您得到的结果没有意义的原因。 You can try 你可以试试

data(iris)
res <- pvclust(iris[, 1:4])
plot(res)

To test the stability of a clustering of cases , you can use clusterboot from package fpc . 要测试案例集群的稳定性，可以使用fpc软件包中的clusterboot 。 See my answer here: Measuring reliability of tree/dendrogram (Traminer) 在这里查看我的答案：测量树/树状图的可靠性（Traminer）

In your example, you could use: 在您的示例中，您可以使用：

library(fpc)
ham <- seqdist(seq_, method="HAM",sm = Autocor)
cf2 <- clusterboot(as.dist(ham), clustermethod=disthclustCBI, k=4, cut="number", method="ward.D")

Using for instance, k=10 you'll have bad results, because your data really have 4 cluster (by construction). 例如，使用k=10您将得到不好的结果，因为您的数据实际上有4个群集（按构造）。

自举自己的内置函数pvclust不起作用

问题描述

1 个解决方案

解决方案1
1 2015-03-26 22:22:11

自举自己的内置函数pvclust不起作用

问题描述

1 个解决方案

解决方案1 1 2015-03-26 22:22:11

解决方案1
1 2015-03-26 22:22:11