[英]Match by minimum Euclidean distance in R
I have the following dataset:我有以下数据集:
X <- data.frame(PERMNO = c(10001,10002,10003,10001,10002,10003),
Date = c("Nov 2021","Nov 2021","Nov 2021","Dec 2021","Dec 2021","Dec 2021"),
ME = c(100,95,110,110,115,108),
IVOL = c(1,1.1,0.8,0.7,1,2.1),
C = c(NA, 2, 3,NA, 4, 1.5))
For firm 10001, the C is missing.对于公司 10001,缺少 C。 I want to fill C, each month, by matching C from other firms by using the firm with non-missing C that minimizes the euclidean distance of the ranked ME and the ranked IVOL with the missing firm:
我想每个月通过使用具有非缺失 C 的公司匹配来自其他公司的 C 来填充 C,从而最小化排名 ME 和排名 IVOL 与缺失公司的欧氏距离:
X in my application has more PERMNOs and a longer time frame, and multiple firms may have C missing.我申请中的 X 有更多的 PERMNOs 和更长的时间范围,并且多个公司可能缺少 C。 My question is how to code this efficiently in R.
我的问题是如何在 R 中有效地对此进行编码。
Taking the ranks is straight forward by using rank() and the Euclidean distance can be calculated using outer() if I am correct.使用 rank() 可以直接进行排名,如果我是正确的,可以使用 outer() 计算欧几里得距离。 However, I struggle with making the pairs of firm i and j and then selecting the minimum distance and subsequently match C from firm j to the missing C for firm i.
然而,我努力制作公司 i 和 j 的对,然后选择最小距离,然后将公司 j 的 C 匹配到公司 i 的缺失 C。
Maybe this helps:也许这有助于:
library(tidyverse)
X <- data.frame(
PERMNO = c(10001, 10002, 10003, 10001, 10002, 10003),
Date = c("Nov 2021", "Nov 2021", "Nov 2021", "Dec 2021", "Dec 2021", "Dec 2021"),
ME = c(100, 95, 110, 110, 115, 108),
IVOL = c(1, 1.1, 0.8, 0.7, 1, 2.1),
C = c(NA, 2, 3, NA, 4, 1.5)
) %>% as_tibble()
X
#> # A tibble: 6 x 5
#> PERMNO Date ME IVOL C
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 10001 Nov 2021 100 1 NA
#> 2 10002 Nov 2021 95 1.1 2
#> 3 10003 Nov 2021 110 0.8 3
#> 4 10001 Dec 2021 110 0.7 NA
#> 5 10002 Dec 2021 115 1 4
#> 6 10003 Dec 2021 108 2.1 1.5
imputations <-
X %>%
rename_all(~ paste0(.x, ".1")) %>%
expand_grid(X %>% rename_all(~ paste0(., ".2"))) %>%
mutate(
dist = sqrt((rank(ME.1) - rank(ME.2))**2 + (rank(IVOL.1) - rank(IVOL.2))**2)
) %>%
group_by(PERMNO.1) %>%
filter(PERMNO.1 != PERMNO.2) %>%
arrange(dist) %>%
slice(1) %>%
ungroup() %>%
transmute(
PERMNO = PERMNO.1,
imputed.C = case_when(
!is.na(C.1) ~ C.1,
!is.na(C.2) ~ C.2
)
)
imputations
#> # A tibble: 3 x 2
#> PERMNO imputed.C
#> <dbl> <dbl>
#> 1 10001 3
#> 2 10002 2
#> 3 10003 3
X %>%
left_join(imputations) %>%
mutate(C = ifelse(is.na(C), imputed.C, C)) %>%
select(-imputed.C)
#> Joining, by = "PERMNO"
#> # A tibble: 6 x 5
#> PERMNO Date ME IVOL C
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 10001 Nov 2021 100 1 3
#> 2 10002 Nov 2021 95 1.1 2
#> 3 10003 Nov 2021 110 0.8 3
#> 4 10001 Dec 2021 110 0.7 3
#> 5 10002 Dec 2021 115 1 4
#> 6 10003 Dec 2021 108 2.1 1.5
Created on 2022-02-18 by the reprex package (v2.0.0)由reprex package (v2.0.0) 创建于 2022-02-18
A data.table
solution using colMins
from the Rfast
package.使用
colMins
中的Rfast
的data.table
解决方案。
library(data.table)
X <- data.frame(PERMNO = c(10001,10002,10003,10001,10002,10003),
Date = c("Nov 2021","Nov 2021","Nov 2021","Dec 2021","Dec 2021","Dec 2021"),
ME = c(100,95,110,110,115,108),
IVOL = c(1,1.1,0.8,0.7,1,2.1),
C = c(NA, 2, 3, NA, 4, 1.5))
fFillNA <- function(C, ME, IVOL) {
idxNA <- which(is.na(C))
C[idxNA] <- C[-idxNA][Rfast::colMins(outer(ME[-idxNA], ME[idxNA], "-")^2 + outer(IVOL[-idxNA], IVOL[idxNA], "-")^2)]
C
}
setDT(X)[, C := if(anyNA(C)) fFillNA(C, ME, IVOL), by = "Date"]
X
#> PERMNO Date ME IVOL C
#> 1: 10001 Nov 2021 100 1.0 2.0
#> 2: 10002 Nov 2021 95 1.1 2.0
#> 3: 10003 Nov 2021 110 0.8 3.0
#> 4: 10001 Dec 2021 110 0.7 1.5
#> 5: 10002 Dec 2021 115 1.0 4.0
#> 6: 10003 Dec 2021 108 2.1 1.5
No need to take the square root to get the index of the minimum distance.无需开平方即可得到最小距离的指标。 Also, notice that because of relative size,
ME
impacts the distance calculation much more than IVOL
, at least for the example dataset.另外,请注意,由于相对大小,
ME
比IVOL
对距离计算的影响要大得多,至少对于示例数据集而言。 Maybe consider normalizing ME
and IVOL
in the distance calculation.也许考虑在距离计算中归一化
ME
和IVOL
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.