[英]r distance between rows
I apologize this is my attempt at redeeming myself after a disastrous earlier attempt . 我很抱歉这是我在一次灾难性的早期尝试后赎回自己的尝试。 Now I have a bit more clarity.
现在我有了更多的清晰度。 So here I go again.
所以我再来一次
My goal is to find rows that are similar. 我的目标是找到相似的行。 So first I am interested in calculating the distance between rows.
所以首先我有兴趣计算行之间的距离。 This is a test dataset below.
这是下面的测试数据集。
Row Blood x1 x2 x3 x4
1 A 0.01 0.16 0.31 0.46
2 A 0.02 0.17 0.32 0.47
3 A 0.03 0.18 0.33 0.48
4 B 0.05 0.20 0.35 0.49
5 B 0.06 0.21 0.36 0.50
6 B 0.07 0.22 0.37 0.51
7 AB 0.09 0.24 0.39 0.52
8 AB 0.1 0.25 0.4 0.53
9 AB 0.11 0.26 0.41 0.54
10 O 0.13 0.28 0.43 0.55
11 O 0.14 0.29 0.44 0.56
12 O 0.15 0.3 0.45 0.57
There are two things here 1) Distance 2) Rows 这里有两件事1)距离2)行
Consider this row combination. 考虑这个行组合。
For Row(1-4-7-10) , distance D = (d1,4 + d1,7 + d1,10 + d4,7 + d4,10 + d7,10)/6 对于行(1-4-7-10),距离D =(d1,4 + d1,7 + d1,10 + d4,7 + d4,10 + d7,10)/ 6
{ Row1-Blood A, Row1-Blood B, Row1- Blood AB, Row1- Blood O }
Distance between Row{1,4,7,10} is calculated based on this concept 基于该概念计算行{1,4,7,10}之间的距离
d1,4 = Distance between : Row1-Blood A, Row1-Blood B
d1,7 = Distance between : Row1-Blood A, Row1-Blood AB
d1,10 = Distance between : Row1-Blood A, Row1-Blood O
d4,7 = Distance between : Row1-Blood B, Row1-Blood AB
d4,10 = Distance between : Row1-Blood B, Row1-Blood O
d7,10 = Distance between : Row1-Blood AB, Row1-Blood O
d-1-4 = (0.01-0.05)^2 + (0.16-0.20)^2 + (0.31-0.35)^2 + (0.46-0.49)^2
d-1-7 = (0.01-0.09)^2 + (0.16-0.24)^2 + (0.31-0.39)^2 + (0.46-0.52)^2
d-1-10 = (0.01-0.13)^2 + (0.16-0.28)^2 + (0.31-0.43)^2 + (0.46-0.55)^2
d-4-7 = (0.05-0.09)^2 + (0.20-0.24)^2 + (0.35-0.39)^2 + (0.49-0.52)^2
d-4-10 = (0.05-0.13)^2 + (0.20-0.28)^2 + (0.35-0.43)^2 + (0.49-0.55)^2
d-7-10 = (0.09-0.13)^2 + (0.24-0.30)^2 + (0.39-0.43)^2 + (0.52-0.55)^2
Similarly I am interested in calculating the distances between 81 different row combinations (3*3*3*3). 同样,我有兴趣计算81种不同行组合之间的距离(3 * 3 * 3 * 3)。
The final expected dataset should look like this below. 最终预期数据集应如下所示。
Row Distance
1-4-7-10
1-4-7-11
1-4-7-12
1-4-8-10
1-4-8-11
1-4-8-12
1-4-9-10
1-4-9-11
1-4-9-12
1-5-7-10
1-5-7-11
1-5-7-12
1-5-8-10
1-5-8-11
1-5-8-12
1-5-9-10
1-5-9-11
1-5-9-12
1-6-7-10
.
.
.
3-6-9-12
I know I can do this with 4 nested loops and lists. 我知道我可以用4个嵌套循环和列表来做到这一点。 I am wondering if there is a more efficient way to accomplish this.
我想知道是否有更有效的方法来实现这一目标。
Similar to the other solution, but I think you could do some matrix indexing inside the function applied to each combination to select the correct cells to add up: 与其他解决方案类似,但我认为您可以在应用于每个组合的函数内部进行一些矩阵索引,以选择要添加的正确单元格:
Be aware that the default ?dist
calculation is: 请注意默认的
?dist
计算是:
sqrt(sum((x_i - y_i)^2))
...while you are using: ......当你使用时:
sum((x_i - y_i)^2)
...so I square the result below: ...所以我将结果放在下面:
dd <- as.matrix(dist(dat[-(1:2)]))^2
apply(
expand.grid(split(dat$Row, dat$Blood)),
1,
function(x) sum(dd[t(combn(x,2))])
)
# [1] 0.1140 0.0972 0.0828 0.1212 0.1036 0.0884 0.1308 ...
Checks out versus manual calculation for the first desired result: 检查第一个期望结果的手动计算:
L <- c(
d1_4 = (0.01-0.05)^2 + (0.16-0.20)^2 + (0.31-0.35)^2 + (0.46-0.49)^2,
d1_7 = (0.01-0.09)^2 + (0.16-0.24)^2 + (0.31-0.39)^2 + (0.46-0.52)^2,
d1_10 = (0.01-0.13)^2 + (0.16-0.28)^2 + (0.31-0.43)^2 + (0.46-0.55)^2,
d4_7 = (0.05-0.09)^2 + (0.20-0.24)^2 + (0.35-0.39)^2 + (0.49-0.52)^2,
d4_10 = (0.05-0.13)^2 + (0.20-0.28)^2 + (0.35-0.43)^2 + (0.49-0.55)^2,
d7_10 = (0.09-0.13)^2 + (0.24-0.28)^2 + (0.39-0.43)^2 + (0.52-0.55)^2
)
sum(L)
# 0.114
"My goal is to find rows that are similar." “我的目标是找到相似的行。”
Two possible approaches: 两种可能的方法:
1) k-means, to separate all data into k
different clusters, identified to find the smallest distance to the centroid of each cluster. 1)k-means,用于将所有数据分成
k
不同的簇,被识别以找到到每个簇的质心的最小距离。
blood_fake$cluster_assignment <- kmeans(blood_fake[, -c(1:2)], centers = 10)$cluster
library(ggplot2)
ggplot(blood_fake, aes(x1, x2, color = as.factor(cluster_assignment))) +
geom_point(size = 0.3) +
theme_minimal() +
theme(legend.position = "bottom")
2) fuzzyjoin::distance_left_join
can be used to find matches that are within a distance threshold. 2)
fuzzyjoin::distance_left_join
可用于查找距离阈值内的匹配。 It worked ok for me with 10,000 rows on an old computer with 4 GB RAM if I ran separately on subsets, but froze up when I tried with all at once. 如果我在子集上单独运行,那么在具有4 GB RAM的旧计算机上有10,000行可以正常工作,但是当我一次尝试所有内容时冻结了。
library(tidyverse); library(fuzzyjoin)
blood_fake %>%
filter(type == "A") %>%
distance_left_join(blood_fake, by = c("x1", "x2", "x3", "x4"), distance_col = "dist", max_dist = 0.05) %>%
filter(dist > 0) %>%
arrange(dist)
# row.x type.x x1.x x2.x x3.x x4.x row.y type.y x1.y x2.y x3.y x4.y dist
#1 8362 A 0.055618062 0.008783874 0.001162073 0.145280936 4786 B 0.05807814 0.009353543 0.002046247 0.146829206 0.003091180
#2 4284 A 0.163417186 0.032845642 0.114224202 0.339505310 2060 AB 0.16676132 0.031621044 0.115635984 0.339447690 0.003831363
#3 8338 A 0.194389332 0.070951537 0.132582667 0.004634504 4839 AB 0.19793256 0.067944898 0.130012004 0.005525959 0.005384918
#4 6849 A 0.277700944 0.027618307 0.034390833 0.158798952 7698 A 0.27344845 0.025502562 0.033016888 0.160972663 0.005401185
#5 7698 A 0.273448453 0.025502562 0.033016888 0.160972663 6849 A 0.27770094 0.027618307 0.034390833 0.158798952 0.005401185
#6 4281 A 0.281189896 0.323468620 0.107589336 0.096526579 6251 A 0.27891482 0.321343667 0.109619143 0.100667052 0.005563725
test data 测试数据
n <- 10000
set.seed(42)
blood_fake <- data.frame(row = 1:n,
type = sample(c("A","B","AB","O"), n, replace = T),
x1 = runif(n, min = 0, max = 0.5),
x2 = runif(n, min = 0, max = 0.5),
x3 = runif(n, min = 0, max = 0.5),
x4 = runif(n, min = 0, max = 0.5)
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.