简体   繁体   中英

Calculating p-value from pseudo-F in R

I'm working with a very large dataset with 132,019 observations of 18 variables. I've used the clusterSim package to calculate the pseudo- F statistic on clusters created using Kohonen SOMs. I'm trying to assess the various cluster sizes (eg, 4, 6, 9 clusters) with p -values, but I'm getting weird results and I'm not statistically savvy enough to know what's going on.

I use the following code to get the pseudo- F .

library(clusterSim)
psF6 <- index.G1(yelpInfScale, cl = som.6$unit.classif)
psF6
[1] 48783.4

Then I use the following code to get the p-value. When I do lower.tail = T I get a 1 and when I do lower.tail = F I get a 0 .

k6 = 6
pf(q = psF6, df1 = k6 - 1, df2 = n - k6, lower.tail = FALSE)
[1] 0

I guess I was expecting not a round number, so I'm confused about how to interpret the results. I get the exact same results regardless of which cluster size I evaluate. I read something somewhere about reversing df1 and df2 in the calculation, but that seems weird. Also, the reference text I'm using (Larose's "Data Mining and Predictive Analytics") uses this to evaluate k-means clusters, so I'm wondering if the problem is that I'm using Kohonen clusters.

I'd check your data, but its not impossible to get p value as either 0 or 1. In your case, assuming you have got your data right, it indicates that you're data is heavily skewed and the clusters you have created are ideal fit. So when you're doing lower.tail = FALSE, the p-value of zero indicates that you're sample is classified with 100% accuracy and there is no chance of an error. The lower.tail = TRUE gives 1 indicates that you clusters very close to each other. In other words, your observations are clustered well away from each other to have a 0 on two tailed test but the centre points of clusters are close enough to give ap value of 1 in one tailed test. If I were you I'd try 'K-means with splitting' variant with different distance parameter 'w' to see how the data fits. IF for some 'w' it fits with very low p values for clusters, I don't think a model as complex as SOM is really necessary.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM