R：跨多列在彼此的范围内查找数据框中的行

Question

我有一个看起来像这样的数据框：

ID   lat   long   score
1    41.5  -62.3  22.4
2    41.0  -70.2  21.9
3    42.2  -63.0  22.7
4    36.7  -72.9  20.0
5    36.2  -62.4  24.1
6    35.8  -61.7  24.7
7    40.8  -61.9  22.1

我想识别此数据帧的行，其中 lat 的值彼此相差 1 个单位，long 的值彼此相差 1 个单位，并且 score 的值彼此相差 0.7 个单位。 为了指示哪些行满足这些条件，我想添加一个新列 (ID.matches)，为满足上述条件的行提供 ID 值。 最终的数据框可能如下所示：

ID   lat   long   score   ID.matches
1    41.5  -62.3  22.4    3, 7
2    41.0  -70.2  21.9    0
3    42.2  -63.0  22.7    1
4    36.7  -72.9  20.0    0
5    36.2  -62.4  24.1    6
6    35.8  -61.7  24.7    5
7    40.8  -61.9  22.1    1

我不确定从哪里开始...我认为某种有条件的 function 使用 dplyr 或 sapply？ 我也不确定是否应该为 ID.matches 使用另一种数据结构，因为某些行会有多个匹配项。

谢谢你的帮助！

Answer 1

您可以使用outer检查所有条件以形成逻辑矩阵（记住要排除自匹配对角线），并将结果apply ID 列的子集，将结果一起粘贴到字符串中：

df$ID.matches <- apply(outer(df$lat,   df$lat,   function(x, y) abs(x - y) <   1) &
                       outer(df$lon,   df$lon,   function(x, y) abs(x - y) <   1) &
                       outer(df$score, df$score, function(x, y) abs(x - y) < 0.7) &
                       diag(nrow(df)) == 0, 
                       MARGIN = 1,
                       function(x) paste(df$ID[x], collapse = ", "))
df
#>   ID  lat  long score ID.matches
#> 1  1 41.5 -62.3  22.4       3, 7
#> 2  2 41.0 -70.2  21.9           
#> 3  3 42.2 -63.0  22.7          1
#> 4  4 36.7 -72.9  20.0           
#> 5  5 36.2 -62.4  24.1          6
#> 6  6 35.8 -61.7  24.7          5
#> 7  7 40.8 -61.9  22.1          1

^{由代表 package (v0.3.0) 于 2020 年 7 月 7 日创建}

Answer 2

另一种方法是使用一些tidyverse函数filter到匹配条件的行并pull匹配行的ID 。

# Create example data
library(tidyverse)

df <- tribble(
~ID,   ~lat,   ~long,   ~score,
1,    41.5,  -62.3,  22.4,
2,    41.0,  -70.2,  21.9,
3,    42.2,  -63.0,  22.7,
4,    36.7,  -72.9,  20.0,
5,    36.2,  -62.4,  24.1,
6,    35.8,  -61.7,  24.7,
7,    40.8,  -61.9,  22.1
)

df$ID.match <- sapply(df$ID, function(x){
  
  df %>%
    filter(abs(lat- lat[ID == x]) < 1,
           abs(long - long[ID == x]) < 1,
           abs(score - score[ID == x]) < 0.7,
           ID != x) %>%
    pull(ID) %>%
    paste0(collapse = ',')
  
})


df
#> # A tibble: 7 x 5
#>      ID   lat  long score ID.match
#>   <dbl> <dbl> <dbl> <dbl> <chr>   
#> 1     1  41.5 -62.3  22.4 "3,7"   
#> 2     2  41   -70.2  21.9 ""      
#> 3     3  42.2 -63    22.7 "1"     
#> 4     4  36.7 -72.9  20   ""      
#> 5     5  36.2 -62.4  24.1 "6"     
#> 6     6  35.8 -61.7  24.7 "5"     
#> 7     7  40.8 -61.9  22.1 "1"

编辑：这是不使用sapply和$的方法（即完全在tidyverse框架中）

df %>%
  mutate(ID.match = map_chr(ID, function(x){
    
    df %>%
      filter(abs(lat- lat[ID == x]) < 1,
             abs(long - long[ID == x]) < 1,
             abs(score - score[ID == x]) < 0.7,
             ID != x) %>%
      pull(ID) %>%
      paste0(collapse = ',')
    
  }))

R：跨多列在彼此的范围内查找数据框中的行

问题描述

2 个解决方案

解决方案1
4 已采纳 2020-07-07 21:15:33

解决方案2
2 2020-07-07 21:38:06

R：跨多列在彼此的范围内查找数据框中的行

问题描述

2 个解决方案

解决方案1 4 已采纳 2020-07-07 21:15:33

解决方案2 2 2020-07-07 21:38:06

解决方案1
4 已采纳 2020-07-07 21:15:33

解决方案2
2 2020-07-07 21:38:06