简体   繁体   English

根据R中的伪F计算p值

[英]Calculating p-value from pseudo-F in R

I'm working with a very large dataset with 132,019 observations of 18 variables. 我正在使用一个非常大的数据集,其中包含18个变量的132,019个观察值。 I've used the clusterSim package to calculate the pseudo- F statistic on clusters created using Kohonen SOMs. 我已经使用clusterSim软件包来计算使用Kohonen SOM创建的集群的伪F统计量。 I'm trying to assess the various cluster sizes (eg, 4, 6, 9 clusters) with p -values, but I'm getting weird results and I'm not statistically savvy enough to know what's going on. 我正在尝试使用p值评估各种聚类大小(例如4、6、9个聚类),但是我得到的结果很奇怪,而且我对统计的了解还不足以了解正在发生的事情。

I use the following code to get the pseudo- F . 我使用以下代码获取伪F。

library(clusterSim)
psF6 <- index.G1(yelpInfScale, cl = som.6$unit.classif)
psF6
[1] 48783.4

Then I use the following code to get the p-value. 然后,我使用以下代码获取p值。 When I do lower.tail = T I get a 1 and when I do lower.tail = F I get a 0 . 当我做lower.tail = T我得到1 ;当我做lower.tail = F我得到0

k6 = 6
pf(q = psF6, df1 = k6 - 1, df2 = n - k6, lower.tail = FALSE)
[1] 0

I guess I was expecting not a round number, so I'm confused about how to interpret the results. 我想我期望的不是整数,所以我对如何解释结果感到困惑。 I get the exact same results regardless of which cluster size I evaluate. 无论我评估哪个集群大小,我都可以获得完全相同的结果。 I read something somewhere about reversing df1 and df2 in the calculation, but that seems weird. 我在计算中读到了一些关于反转df1df2的信息,但这似乎很奇怪。 Also, the reference text I'm using (Larose's "Data Mining and Predictive Analytics") uses this to evaluate k-means clusters, so I'm wondering if the problem is that I'm using Kohonen clusters. 另外,我正在使用的参考文本(Larose的“数据挖掘和预测分析”)使用它来评估k均值聚类,因此我想知道问题是否出在我正在使用Kohonen聚类。

I'd check your data, but its not impossible to get p value as either 0 or 1. In your case, assuming you have got your data right, it indicates that you're data is heavily skewed and the clusters you have created are ideal fit. 我会检查您的数据,但并非不可能将p值设为0或1。在您的情况下,假设您的数据正确无误,则表明您的数据严重偏斜,并且创建的群集处于完美契合。 So when you're doing lower.tail = FALSE, the p-value of zero indicates that you're sample is classified with 100% accuracy and there is no chance of an error. 因此,当您执行lower.tail = FALSE时,p值零表示您的样本被分类为100%准确度,并且没有出错的机会。 The lower.tail = TRUE gives 1 indicates that you clusters very close to each other. lower.tail = TRUE给出1表示您彼此非常接近地聚集。 In other words, your observations are clustered well away from each other to have a 0 on two tailed test but the centre points of clusters are close enough to give ap value of 1 in one tailed test. 换句话说,您的观察结果彼此簇聚得很远,在两个尾部测试中为0,但是簇的中心点足够接近,从而在一次尾部测试中ap值为1。 If I were you I'd try 'K-means with splitting' variant with different distance parameter 'w' to see how the data fits. 如果您是我,请尝试使用具有不同距离参数“ w”的“ K均值分裂”变体,以查看数据的拟合度。 IF for some 'w' it fits with very low p values for clusters, I don't think a model as complex as SOM is really necessary. 如果某些“ w”适合群集的非常低的p值,我认为没有必要像SOM这样复杂的模型。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM