简体   繁体   English

在 R 中按最小欧氏距离匹配

[英]Match by minimum Euclidean distance in R

I have the following dataset:我有以下数据集:

X <- data.frame(PERMNO = c(10001,10002,10003,10001,10002,10003),
                Date = c("Nov 2021","Nov 2021","Nov 2021","Dec 2021","Dec 2021","Dec 2021"),
                     ME = c(100,95,110,110,115,108),
                     IVOL = c(1,1.1,0.8,0.7,1,2.1),
                     C = c(NA, 2, 3,NA, 4, 1.5))

For firm 10001, the C is missing.对于公司 10001,缺少 C。 I want to fill C, each month, by matching C from other firms by using the firm with non-missing C that minimizes the euclidean distance of the ranked ME and the ranked IVOL with the missing firm:我想每个月通过使用具有非缺失 C 的公司匹配来自其他公司的 C 来填充 C,从而最小化排名 ME 和排名 IVOL 与缺失公司的欧氏距离:

在此处输入图像描述

X in my application has more PERMNOs and a longer time frame, and multiple firms may have C missing.我申请中的 X 有更多的 PERMNOs 和更长的时间范围,并且多个公司可能缺少 C。 My question is how to code this efficiently in R.我的问题是如何在 R 中有效地对此进行编码。

Taking the ranks is straight forward by using rank() and the Euclidean distance can be calculated using outer() if I am correct.使用 rank() 可以直接进行排名,如果我是正确的,可以使用 outer() 计算欧几里得距离。 However, I struggle with making the pairs of firm i and j and then selecting the minimum distance and subsequently match C from firm j to the missing C for firm i.然而,我努力制作公司 i 和 j 的对,然后选择最小距离,然后将公司 j 的 C 匹配到公司 i 的缺失 C。

Maybe this helps:也许这有助于:

library(tidyverse)

X <- data.frame(
  PERMNO = c(10001, 10002, 10003, 10001, 10002, 10003),
  Date = c("Nov 2021", "Nov 2021", "Nov 2021", "Dec 2021", "Dec 2021", "Dec 2021"),
  ME = c(100, 95, 110, 110, 115, 108),
  IVOL = c(1, 1.1, 0.8, 0.7, 1, 2.1),
  C = c(NA, 2, 3, NA, 4, 1.5)
) %>% as_tibble()
X
#> # A tibble: 6 x 5
#>   PERMNO Date        ME  IVOL     C
#>    <dbl> <chr>    <dbl> <dbl> <dbl>
#> 1  10001 Nov 2021   100   1    NA  
#> 2  10002 Nov 2021    95   1.1   2  
#> 3  10003 Nov 2021   110   0.8   3  
#> 4  10001 Dec 2021   110   0.7  NA  
#> 5  10002 Dec 2021   115   1     4  
#> 6  10003 Dec 2021   108   2.1   1.5

imputations <-
  X %>%
  rename_all(~ paste0(.x, ".1")) %>%
  expand_grid(X %>% rename_all(~ paste0(., ".2"))) %>%
  mutate(
    dist = sqrt((rank(ME.1) - rank(ME.2))**2 + (rank(IVOL.1) - rank(IVOL.2))**2)
  ) %>%
  group_by(PERMNO.1) %>%
  filter(PERMNO.1 != PERMNO.2) %>%
  arrange(dist) %>%
  slice(1) %>%
  ungroup() %>%
  transmute(
    PERMNO = PERMNO.1,
    imputed.C = case_when(
      !is.na(C.1) ~ C.1,
      !is.na(C.2) ~ C.2
    )
  )
imputations
#> # A tibble: 3 x 2
#>   PERMNO imputed.C
#>    <dbl>     <dbl>
#> 1  10001         3
#> 2  10002         2
#> 3  10003         3

X %>%
  left_join(imputations) %>%
  mutate(C = ifelse(is.na(C), imputed.C, C)) %>%
  select(-imputed.C)
#> Joining, by = "PERMNO"
#> # A tibble: 6 x 5
#>   PERMNO Date        ME  IVOL     C
#>    <dbl> <chr>    <dbl> <dbl> <dbl>
#> 1  10001 Nov 2021   100   1     3  
#> 2  10002 Nov 2021    95   1.1   2  
#> 3  10003 Nov 2021   110   0.8   3  
#> 4  10001 Dec 2021   110   0.7   3  
#> 5  10002 Dec 2021   115   1     4  
#> 6  10003 Dec 2021   108   2.1   1.5

Created on 2022-02-18 by the reprex package (v2.0.0)reprex package (v2.0.0) 创建于 2022-02-18

A data.table solution using colMins from the Rfast package.使用colMins中的Rfastdata.table解决方案。

library(data.table)

X <- data.frame(PERMNO = c(10001,10002,10003,10001,10002,10003),
                Date = c("Nov 2021","Nov 2021","Nov 2021","Dec 2021","Dec 2021","Dec 2021"),
                ME = c(100,95,110,110,115,108),
                IVOL = c(1,1.1,0.8,0.7,1,2.1),
                C = c(NA, 2, 3, NA, 4, 1.5))

fFillNA <- function(C, ME, IVOL) {
  idxNA <- which(is.na(C))
  C[idxNA] <- C[-idxNA][Rfast::colMins(outer(ME[-idxNA], ME[idxNA], "-")^2 + outer(IVOL[-idxNA], IVOL[idxNA], "-")^2)]
  C
}

setDT(X)[, C := if(anyNA(C)) fFillNA(C, ME, IVOL), by = "Date"]
X
#>    PERMNO     Date  ME IVOL   C
#> 1:  10001 Nov 2021 100  1.0 2.0
#> 2:  10002 Nov 2021  95  1.1 2.0
#> 3:  10003 Nov 2021 110  0.8 3.0
#> 4:  10001 Dec 2021 110  0.7 1.5
#> 5:  10002 Dec 2021 115  1.0 4.0
#> 6:  10003 Dec 2021 108  2.1 1.5

No need to take the square root to get the index of the minimum distance.无需开平方即可得到最小距离的指标。 Also, notice that because of relative size, ME impacts the distance calculation much more than IVOL , at least for the example dataset.另外,请注意,由于相对大小, MEIVOL对距离计算的影响要大得多,至少对于示例数据集而言。 Maybe consider normalizing ME and IVOL in the distance calculation.也许考虑在距离计算中归一化MEIVOL

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM