如何固定嵌套的for循環R

Question

我有兩個數據集，其中一個非常大。 我正在嘗試運行以下循環以在數據集a中創建處理列treatment 。 但是，速度太慢了。 我尋找了一種方法來固定 for 循環，如矢量化或定義循環外的條件，但是我很難應用這些方法，因為我有兩個要調節的數據集。

這是我的代碼：

reform_loop <- function(a, b){
  for(i in 1:nrow(a)) {
    
    for(j in 1:nrow(b)){
      if(!is.na(a[i,"treatment"])){break}
      

      a[i,"treatment"] <- case_when(a[i,"country_code"] == b[j, "country_code"] &
                            a[i,"birth_year"] >= b[j,"cohort"] &
                            a[i,"birth_year"]<= b[j,"upper_cutoff"] ~ 1,
                          
                          a[i,"country_code"] == b[j, "country_code"] &
                            a[i,"birth_year"] < b[j,"cohort"]&
                            a[i,"birth_year"]>= b[j,"lower_cutoff"] ~ 0)
      
    }
  }
  return(a)
}

a <- reform_loop(a, b)

您可以在下面找到示例數據集。 數據集a是包含出生年份信息的個人數據集，數據集b是包含一些國家改革信息的國家級數據。 如果birth_year在cohort和lower_cutoff之間，則treatment為1，如果在特定國家/地區的cohort和upper_cutoff之間，則為0，這意味着country_code變量也應該匹配。 其他任何東西都應該是NA。

#individual level data, birth years
a <- data.frame (country_code = c(2,2,2,10,10,10,10,8), 
                               birth_year = c(1920,1930,1940,1970,1980,1990, 2000, 1910))
#country level reform info with affected cohorts
b <- data.frame(country_code = c(2,10,10,11),
                      lower_cutoff = c(1928, 1975, 1907, 1934),
                      upper_cutoff = c(1948, 1995, 1927, 1948),
                      cohort = c(1938, 1985, 1917, 1942))

以下是我想要得到的結果：

treatment <- c(NA, 0, 1, NA, 0, 1, NA, NA)

不幸的是，我無法合並這兩個數據集，因為我的數據集中的大多數國家/地區都有不止一項改革。

關於如何固定此代碼的任何想法？ 非常感謝您！

Answer 1

這是一個基於范圍的非等值連接。 因此，這可以通過data.table或fuzzyjoin或sqldf來完成。

data.table

library(data.table)
setDT(a)
setDT(b)
b[, treatment := 1L]
a[b, treatment := i.treatment, on = .(country_code, birth_year >= lower_cutoff, birth_year <= upper_cutoff)]
a[is.na(treatment), treatment := 0L]
a
#    country_code birth_year treatment
#           <num>      <num>     <int>
# 1:            2       1920         0
# 2:            2       1930         1
# 3:            2       1940         1
# 4:           10       1970         0
# 5:           10       1980         1
# 6:           10       1990         1
# 7:           10       2000         0
# 8:            8       1910         0

sqldf

out <- sqldf::sqldf("select a.*, b.treatment from a left join b on a.country_code=b.country_code and a.birth_year between b.lower_cutoff and b.upper_cutoff")
out$treatment[is.na(out$treatment)] <- 0L
out
#   country_code birth_year treatment
# 1            2       1920         0
# 2            2       1930         1
# 3            2       1940         1
# 4           10       1970         0
# 5           10       1980         1
# 6           10       1990         1
# 7           10       2000         0
# 8            8       1910         0

模糊連接

fuzzyjoin::fuzzy_left_join(a, b, by = c("country_code" = "country_code", "birth_year" = "lower_cutoff", "birth_year" = "upper_cutoff"), match_fun = list(`==`, `>=`, `<=`))
#   country_code.x birth_year country_code.y lower_cutoff upper_cutoff cohort treatment
# 1              2       1920             NA           NA           NA     NA        NA
# 2              2       1930              2         1928         1948   1938         1
# 3              2       1940              2         1928         1948   1938         1
# 4             10       1970             NA           NA           NA     NA        NA
# 5             10       1980             10         1975         1995   1985         1
# 6             10       1990             10         1975         1995   1985         1
# 7             10       2000             NA           NA           NA     NA        NA
# 8              8       1910             NA           NA           NA     NA        NA

然后您需要清理多余的列（並為NA填充0 ）。

如何固定嵌套的for循環R

問題描述

1 個解決方案

解決方案1
1 2022-08-15 21:54:16

data.table

sqldf

模糊連接

如何固定嵌套的for循環R

問題描述

1 個解決方案

解決方案1 1 2022-08-15 21:54:16

data.table

sqldf

模糊連接

解決方案1
1 2022-08-15 21:54:16