简体   繁体   中英

unique pairwise distances between any points in the dataframe

I have a list of ten points with X and coordinates. I would like to calculate the possible permutations of distances between any two points. Precisely, only one of the distances in 1-2, 2-1 should be present. I have managed to remove the distances of a point with itself. But couldn't achieve this permutation distances.

# Data Generation
df <- data.frame(X = runif(10, 0, 1), Y = runif(10, 0, 1), ID = 1:10)

# Temporary key Creation
df <- df %>% mutate(key = 1) 

# Calculating pairwise distances
df %>% full_join(df, by = "key") %>% 
  mutate(dist = sqrt((X.x - X.y)^2 + (Y.x - Y.y)^2)) %>% 
  select(ID.x, ID.y, dist) %>% filter(!dist == 0) %>% head(11)

# Output 
#    ID.x ID.y       dist
# 1     1    2 0.90858911
# 2     1    3 0.71154587
# 3     1    4 0.05687495
# 4     1    5 1.03885510
# 5     1    6 0.93747717
# 6     1    7 0.62070415
# 7     1    8 0.88351690
# 8     1    9 0.89651911
# 9     1   10 0.05079906
# 10    2    1 0.90858911
# 11    2    3 0.27530175

How to achieve the expected output shown below?

# Expected Output 
#    ID.x ID.y       dist
# 1     1    2 0.90858911
# 2     1    3 0.71154587
# 3     1    4 0.05687495
# 4     1    5 1.03885510
# 5     1    6 0.93747717
# 6     1    7 0.62070415
# 7     1    8 0.88351690
# 8     1    9 0.89651911
# 9     1   10 0.05079906
# 10    2    3 0.27530175
# 11    2    4 0.5415415

But this approach is computationally slower compared to dist() . Would be happier to listen to faster approaches.

I would use dist on the data and then process the output into the required format. You can replace dist with any other distance function. Here I've used letters rather than numbers as ID to better show what is happening

set.seed(42)
df <- data.frame(X = runif(10, 0, 1), Y = runif(10, 0, 1), ID = letters[1:10])

df %>% 
  column_to_rownames("ID") %>% #make the ID the rownames. dist will use these> NB will not work on a tibble
  dist() %>% 
  as.matrix() %>% 
  as.data.frame() %>% 
  rownames_to_column(var = "ID.x") %>% #capture the row IDs
  gather(key = ID.y, value = dist, -ID.x) %>% 
  filter(ID.x < ID.y) %>% 
  as_tibble()

   # A tibble: 45 x 3
    ID.x  ID.y      dist
   <chr> <chr>     <dbl>
 1     a     b 0.2623175
 2     a     c 0.7891034
 3     b     c 0.6856994
 4     a     d 0.2191960
 5     b     d 0.4757855
 6     c     d 0.8704269
 7     a     e 0.2730984
 8     b     e 0.3913770
 9     c     e 0.5912681
10     d     e 0.2800021
# ... with 35 more rows

dist is very fast compared with looping through calculating distances. The code can probably be made more efficient, by working directly of the dist object rather than converting it into a matrix.

Perhaps this is a slightly simpler approach:

df <- data.frame(X = runif(10, 0, 1), Y = runif(10, 0, 1), ID = 1:10)

df2 <- data.frame(ID1 = rep(1:10, each = 10),
                  ID2 = 1:10,
                  distance = as.vector(as.matrix((dist(df)))))

Then get rid of diagonal:

df2 <- df2[df2$ID1 != df2$ID2,]

Get rid of upper triangle:

df2 <- df2[df2$ID1 < df2$ID2,]
df2
ID1 ID2 distance
2    1   2 1.000615
3    1   3 2.057813
4    1   4 3.010261
5    1   5 4.039502
6    1   6 5.029982
7    1   7 6.035427
8    1   8 7.012540
9    1   9 8.006249
10   1  10 9.015352
13   2   3 1.099245
14   2   4 2.011664
...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM