简体   繁体   English

在SPSS,R或Excel中按其他变量分组的向量之间的欧氏距离

[英]euclidean distance between vectors grouped by other variable in SPSS, R or Excel

I have a dataset containing something like this: 我有一个包含这样的数据集:

case,group,val1,val2,val3,val4
1,1,3,5,6,8
2,1,2,7,5,4
3,2,1,3,6,8
4,2,5,4,3,7
5,1,8,6,5,3

I'm trying to compute programmatically the Euclidean distance between the vectors of values in groups. 我试图以编程方式计算组中值向量之间的欧几里德距离。

This means that I have x number of cases in n number of groups. 这意味着我在n个组中有x个案例。 The euclidean distance is computed between pairs of rows and then averaged for the group. 在成对的行之间计算欧氏距离,然后对该组进行平均。 So, in the example above, first I compute the mean and std dev of group 1 (case 1, 2 and 5), then standardise values (ie [(original value - mean)/st dev], then compute the ED between case 1 and case 2, case 2 and 5, and case 1 and 5, and finally average the ED for the group. 因此,在上面的例子中,首先我计算组1的平均值和标准差(情况1,2和5),然后标准化值(即[(原始值 - 平均值)/ st dev],然后计算案例之间的ED 1和案例2,案例2和5,以及案例1和5,最后平均该组的ED。

Can anyone suggest a neat way of achieving this in a reasonably efficient way? 任何人都可以建议以一种合理有效的方式实现这一目标吗?

Yes, it is probably easier in R... 是的,它可能更容易在R ...

Your data: 你的数据:

dat <- data.frame(case  = 1:5, 
                  group = c(1, 1, 2, 2, 1),
                  val1  = c(3, 2, 1, 5, 8),
                  val2  = c(5, 7, 3, 4, 6),
                  val3  = c(6, 5, 6, 3, 5),
                  val4  = c(8, 4, 8, 7, 3))

A short solution: 简短的解决方案:

library(plyr)
ddply(dat[c("group", "val1", "val2", "val3", "val4")],
      "group", function(x)c(mean.ED = mean(dist(scale(as.matrix(x))))))
#   group  mean.ED
# 1     1 3.121136
# 2     2 3.162278

As an example of how I would approach this in SPSS, first lets read the example data into SPSS. 作为我在SPSS中如何处理此问题的示例,首先让我们将示例数据读入SPSS。

data list list (",") / case group val1 val2 val3 val4 (6F1.0).
begin data
1,1,3,5,6,8
2,1,2,7,5,4
3,2,1,3,6,8
4,2,5,4,3,7
5,1,8,6,5,3
end data.
dataset name orig.

Then we can use SPLIT FILE and PROXIMITIES to get our distance matrix by group. 然后我们可以使用SPLIT FILEPROXIMITIES来按组获取距离矩阵。 Note, as you mentioned in the comments to flodel's answer, this produces a seperate dataset we need to work with (also note case practically never matters in SPSS syntax, eg split file and SPLIT FILE are equivalent). 请注意,正如您在对flodel的回答的评论中所提到的,这会生成我们需要使用的单独数据集(同样注意案例在SPSS语法中几乎从不重要,例如, split fileSPLIT FILE是等效的)。

sort cases by group.
split file by group.
dataset declare dist.
PROXIMITIES val1, val2, val3, val4
/STANDARDIZE = Z
/MEASURE = EUCLID
/PRINT = NONE
/MATRIX = OUT('dist').

Unlike R, basically everything within an SPSS data matrix is like an R data.frame , so SPLIT file near functionally replaces all the different *ply functions in R. Very convienant, but less flexible in general. 与R不同,SPSS数据矩阵中的所有内容基本上都像R data.frame ,因此SPLIT文件在功能上接近R中的所有不同的*ply函数。非常方便,但一般不太灵活。 So now we need to aggregate the distances in the dist file I saved the results to. 所以现在我们需要聚合我保存结果的dist文件中的距离。 We first sum across rows, and then sum by group via an AGGREGATE command. 我们首先对行进行求和,然后通过AGGREGATE命令按组进行求和。

dataset activate dist.
compute dist_sum = SUM(VAR1 to VAR3).
*it appears SPSS keeps empty cases - we dont want them in the aggregation.
select if MISSING(dist_sum) = 0.
dataset activate dist.
DATASET DECLARE dist_agg.
AGGREGATE
  /OUTFILE='dist_agg'
  /BREAK=group
  /dist_sum = SUM(dist_sum)
  /N_Cases=N.
dataset activate dist_agg.
compute mean_dist = dist_sum /(N_Cases*(N_Cases - 1)).

Here I save the aggregated results into another dataset named dist_agg . 在这里,我将聚合结果保存到另一个名为dist_agg数据集中。 Because SPSS (annoyingly) saves the full distance matrix, the mean will not be n*(n-1)/2 (as in the equivalent R syntax), but will be n*(n-1) assuming you do not want to count the diagonal elements towards the mean. 因为SPSS(恼人地)保存了全距离矩阵,所以平均值不会是n*(n-1)/2 (如等效的R语法中所示),但假设你不想要n*(n-1)计算对角元素的平均值。 Then we can just merge these back into the orig data file via a match files command. 然后我们可以通过match files命令将它们合并回orig数据文件。

*merge back into the original dataset.
dataset activate orig.
match files file = *
/table = 'dist_agg'
/by group.
exe.

*clean out old datasets if you like.
dataset close dist.
dataset close dist_agg.

The flexibility of R to go back and forth between matrix and data.frame objects makes SPSS a bit more clunky for this job. R在matrixdata.frame对象之间来回的灵活性使得SPSS对这项工作更加笨拙。 I could write a much more concise program to do this in SPSS's MATRIX language, but to do it across groups in MATRIX is a pain in the butt (compared to R's *ply syntax). 我可以用SPSS的MATRIX语言编写一个更简洁的程序,但是在MATRIX跨组执行它是一个痛苦的屁股(与R的*ply语法相比)。

Here is a much simpler solution using base R. 这是使用基础R的更简单的解决方案。

d <- by (dat[,2:5], dat$group, function(x) dist(x))

sapply(d,mean) sapply(d,平均值)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM