如何固定嵌套的for循环R

Question

I have two datasets, and one of them is very big.我有两个数据集，其中一个非常大。 I'm trying to run the following loop to create a treatment column, treatment , in the dataset a .我正在尝试运行以下循环以在数据集a中创建处理列treatment 。 However, it is way too slow.但是，速度太慢了。 I looked for a way to fasten for-loops like vectorization or defining conditions outside the loops however I'm having a hard time applying those methods since I have two datasets I'm conditioning on.我寻找了一种方法来固定 for 循环，如矢量化或定义循环外的条件，但是我很难应用这些方法，因为我有两个要调节的数据集。

Here is my code:这是我的代码：

reform_loop <- function(a, b){
  for(i in 1:nrow(a)) {
    
    for(j in 1:nrow(b)){
      if(!is.na(a[i,"treatment"])){break}
      

      a[i,"treatment"] <- case_when(a[i,"country_code"] == b[j, "country_code"] &
                            a[i,"birth_year"] >= b[j,"cohort"] &
                            a[i,"birth_year"]<= b[j,"upper_cutoff"] ~ 1,
                          
                          a[i,"country_code"] == b[j, "country_code"] &
                            a[i,"birth_year"] < b[j,"cohort"]&
                            a[i,"birth_year"]>= b[j,"lower_cutoff"] ~ 0)
      
    }
  }
  return(a)
}

a <- reform_loop(a, b)

You can find a sample dataset below.您可以在下面找到示例数据集。 Dataset a is an individual dataset with birth year informations and dataset b is country-level data with some country reform information.数据集a是包含出生年份信息的个人数据集，数据集b是包含一些国家改革信息的国家级数据。 treatment is 1 if the birth_year is between the cohort and upper_cutoff and 0 if between cohort and lower_cutoff in a specific country which means country_code variables should also be matched.如果birth_year在cohort和lower_cutoff之间，则treatment为1，如果在特定国家/地区的cohort和upper_cutoff之间，则为0，这意味着country_code变量也应该匹配。 And anything else should be NA.其他任何东西都应该是NA。

#individual level data, birth years
a <- data.frame (country_code = c(2,2,2,10,10,10,10,8), 
                               birth_year = c(1920,1930,1940,1970,1980,1990, 2000, 1910))
#country level reform info with affected cohorts
b <- data.frame(country_code = c(2,10,10,11),
                      lower_cutoff = c(1928, 1975, 1907, 1934),
                      upper_cutoff = c(1948, 1995, 1927, 1948),
                      cohort = c(1938, 1985, 1917, 1942))

The following is the result I want to get:以下是我想要得到的结果：

treatment <- c(NA, 0, 1, NA, 0, 1, NA, NA)

Unfortunately, I cannot merge these two datasets since most of the countries in my dataset have more than one reform.不幸的是，我无法合并这两个数据集，因为我的数据集中的大多数国家/地区都有不止一项改革。

Any ideas on how can I fasten this code?关于如何固定此代码的任何想法？ Thank you so much in advance!非常感谢您！

Answer 1

This is a range-based non-equi join.这是一个基于范围的非等值连接。 As such, this can be done with data.table or fuzzyjoin or sqldf .因此，这可以通过data.table或fuzzyjoin或sqldf来完成。

data.table data.table

library(data.table)
setDT(a)
setDT(b)
b[, treatment := 1L]
a[b, treatment := i.treatment, on = .(country_code, birth_year >= lower_cutoff, birth_year <= upper_cutoff)]
a[is.na(treatment), treatment := 0L]
a
#    country_code birth_year treatment
#           <num>      <num>     <int>
# 1:            2       1920         0
# 2:            2       1930         1
# 3:            2       1940         1
# 4:           10       1970         0
# 5:           10       1980         1
# 6:           10       1990         1
# 7:           10       2000         0
# 8:            8       1910         0

sqldf sqldf

out <- sqldf::sqldf("select a.*, b.treatment from a left join b on a.country_code=b.country_code and a.birth_year between b.lower_cutoff and b.upper_cutoff")
out$treatment[is.na(out$treatment)] <- 0L
out
#   country_code birth_year treatment
# 1            2       1920         0
# 2            2       1930         1
# 3            2       1940         1
# 4           10       1970         0
# 5           10       1980         1
# 6           10       1990         1
# 7           10       2000         0
# 8            8       1910         0

fuzzyjoin模糊连接

fuzzyjoin::fuzzy_left_join(a, b, by = c("country_code" = "country_code", "birth_year" = "lower_cutoff", "birth_year" = "upper_cutoff"), match_fun = list(`==`, `>=`, `<=`))
#   country_code.x birth_year country_code.y lower_cutoff upper_cutoff cohort treatment
# 1              2       1920             NA           NA           NA     NA        NA
# 2              2       1930              2         1928         1948   1938         1
# 3              2       1940              2         1928         1948   1938         1
# 4             10       1970             NA           NA           NA     NA        NA
# 5             10       1980             10         1975         1995   1985         1
# 6             10       1990             10         1975         1995   1985         1
# 7             10       2000             NA           NA           NA     NA        NA
# 8              8       1910             NA           NA           NA     NA        NA

and then you need to clean up the extra columns (and fill 0 for NA ).然后您需要清理多余的列（并为NA填充0 ）。

如何固定嵌套的for循环R

问题描述

1 个解决方案

解决方案1
1 2022-08-15 21:54:16

data.table data.table

sqldf sqldf

fuzzyjoin模糊连接

如何固定嵌套的for循环R

问题描述

1 个解决方案

解决方案1 1 2022-08-15 21:54:16

data.table data.table

sqldf sqldf

fuzzyjoin模糊连接

解决方案1
1 2022-08-15 21:54:16