简体   繁体   English

为什么距离矩阵(dist())为观察到超过50个数据的数据集提供空值?

[英]Why is distance matrix (dist()) giving empty values for data sets having more than ~50 observations?

I have a data set for which I'm calculating its distance matrix. 我有一个要为其计算距离矩阵的数据集。 Below is the data, which has 251 observations. 以下是数据,其中包含251个观测值。

> str(mydata)
'data.frame':   251 obs. of  7 variables:
 $ BodyFat: num  12.3 6.1 25.3 10.4 28.7 20.9 19.2 12.4 4.1 11.7 ...
 $ Weight : num  154 173 154 185 184 ...
 $ Chest  : num  93.1 93.6 95.8 101.8 97.3 ...
 $ Abdomen: num  85.2 83 87.9 86.4 100 94.4 90.7 88.5 82.5 88.6 ...
 $ Hip    : num  94.5 98.7 99.2 101.2 101.9 ...
 $ Thigh  : num  59 58.7 59.6 60.1 63.2 66 58.4 60 62.9 63.1 ...
 $ Biceps : num  32 30.5 28.8 32.4 32.2 35.7 31.9 30.5 35.9 35.6 ...

I normalize the data. 我规范化数据。

means = apply(mydata,2,mean) 
sds = apply(mydata,2,sd)    
nor = scale(mydata,center=means,scale=sds)

When i calculate the distance matrix, I can see lot of empty values and moreover distance is measured only from 4 observations. 当我计算距离矩阵时,我可以看到很多空值,并且距离仅根据4个观测值进行测量。

distance =dist(nor) 

> str(distance)
 'dist' num [1:31375] 1.33 2.09 1.9 3.08 3.99 ...
 - attr(*, "Size")= int 251
 - attr(*, "Labels")= chr [1:251] "1" "2" "3" "4" ...
 - attr(*, "Diag")= logi FALSE
 - attr(*, "Upper")= logi FALSE
 - attr(*, "method")= chr "euclidean"
 - attr(*, "call")= language dist(x = nor)


> distance  # o/p omitted from this post as it has 257 observations.

             1          2          3          4          5          6          7
2    1.3346445                                                                  
3    2.0854437  2.5474796                                                       
4    1.8993458  1.4908813  2.5840752                                            
5    3.0790252  3.4485667  2.2165366  2.7021809                                 
             8          9         10         11         12         13         14
2                                                                               
3                                                                               
4                                                                               
5                                                                               
            15         16         17         18         19         20         21

This list goes on empty for the remaining 247 comparisons. 对于其余的247个比较,此列表为空。

Now, I reduce the data set to 20 observations 现在,我将数据集减少到20个观察值

Here I get a proper distance matrix. 在这里,我得到一个合适的距离矩阵。

distancetiny=dist(nor)

> str(distancetiny)
 'dist' num [1:1176] 1.14 1.8 1.61 2.62 3.39 ...
 - attr(*, "Size")= int 49
 - attr(*, "Labels")= chr [1:49] "1" "2" "3" "4" ...
 - attr(*, "Diag")= logi FALSE
 - attr(*, "Upper")= logi FALSE
 - attr(*, "method")= chr "euclidean"
 - attr(*, "call")= language dist(x = nor)

> distancetiny
            1          2          3          4          5          6          7
2   1.1380433                                                                  
3   1.7990293  2.2088928                                                       
4   1.6064118  1.2871522  2.2483586                                            
5   2.6235853  2.9669283  1.9132224  2.3256624                                 
6   3.3898119  3.3730508  3.3718447  2.2615557  2.0094434                      
7   1.8947704  2.0065514  1.7685604  1.1065940  1.7387938  2.2321156           
8   1.1732465  1.0663217  1.6733689  0.8873140  2.1959298  2.7939555  1.1448269
9   2.2721969  2.0545882  3.4263262  1.4058375  3.1811955  2.4011074  2.3078714
10  2.3753110  2.2424464  3.0289947  1.2808398  2.3230202  1.4242653  1.8571654
11  1.5620472  1.1878554  2.5750350  0.5718248  2.7714795  2.6314286  1.5132365
12  3.5088571  3.2484020  4.1164488  2.2723772  3.1377318  1.4795230  2.8274818
13  2.1448841  2.2679705  1.8726670  1.3494988  1.2176727  1.5544030  1.0725518
14  3.6679035  3.7459402  3.6869023  2.6677308  2.1318420  0.7347359  2.5729973
15  2.9908457  3.3312661  3.1289870  2.4340473  1.8027070  1.3626019  2.3795360
16  1.6117570  2.0283356  1.2011116  1.5961064  1.3196981  2.4456436  1.2569683
17  3.2991393  3.5991747  3.0438049  2.6066933  1.4742664  1.0945621  2.2214101
18  3.9409008  4.0726826  4.0113908  2.9250144  2.5228901  0.9087254  2.8158563
19  2.7468511  2.9495031  3.2439229  1.8312508  2.4122436  1.3932604  1.9640170
20  3.7515064  3.7021743  3.9404231  2.5813440  2.5390519  0.8352961  2.6530503
21  2.3102053  2.3878491  2.0836800  1.4328028  1.2991221  1.5287862  1.1769205

There is no empty values in the output when the observation is 21. 观察值为21时,输出中没有空值。

Why is this so? 为什么会这样呢? Does the dist() do not work when the observation count goes beyond a threshold ? 当观察计数超过阈值时,dist()不起作用吗?

I'm unable to figure it out. 我无法弄清楚。 Please help. 请帮忙。

This seems to be a size issue. 这似乎是一个大小问题。 When the dataset contains more than 60-80 observations, the distance matrix is unable to be displayed properly (even for the initial rows). 当数据集包含60-80个以上的观察值时,距离矩阵将无法正确显示(即使对于初始行也是如此)。 Looks like the values are present in it perfectly alright, and just that we cannot see them as it is. 看起来其中的值完美无缺,只是我们无法按原样看到它们。 Further operation on the distance matrix (like Hierarchical agglomerative clustering ) proved that nothing to worried about it's weird display. 对距离矩阵的进一步操作(例如层次聚结聚类)证明,不必担心它的显示怪异。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM