简体   繁体   English

如何固定嵌套的for循环R

[英]How to fasten nested for-loop R

I have two datasets, and one of them is very big.我有两个数据集,其中一个非常大。 I'm trying to run the following loop to create a treatment column, treatment , in the dataset a .我正在尝试运行以下循环以在数据集a中创建处理列treatment However, it is way too slow.但是,速度太慢了。 I looked for a way to fasten for-loops like vectorization or defining conditions outside the loops however I'm having a hard time applying those methods since I have two datasets I'm conditioning on.我寻找了一种方法来固定 for 循环,如矢量化或定义循环外的条件,但是我很难应用这些方法,因为我有两个要调节的数据集。

Here is my code:这是我的代码:

reform_loop <- function(a, b){
  for(i in 1:nrow(a)) {
    
    for(j in 1:nrow(b)){
      if(!is.na(a[i,"treatment"])){break}
      

      a[i,"treatment"] <- case_when(a[i,"country_code"] == b[j, "country_code"] &
                            a[i,"birth_year"] >= b[j,"cohort"] &
                            a[i,"birth_year"]<= b[j,"upper_cutoff"] ~ 1,
                          
                          a[i,"country_code"] == b[j, "country_code"] &
                            a[i,"birth_year"] < b[j,"cohort"]&
                            a[i,"birth_year"]>= b[j,"lower_cutoff"] ~ 0)
      
    }
  }
  return(a)
}

a <- reform_loop(a, b)

You can find a sample dataset below.您可以在下面找到示例数据集。 Dataset a is an individual dataset with birth year informations and dataset b is country-level data with some country reform information.数据集a是包含出生年份信息的个人数据集,数据集b是包含一些国家改革信息的国家级数据。 treatment is 1 if the birth_year is between the cohort and upper_cutoff and 0 if between cohort and lower_cutoff in a specific country which means country_code variables should also be matched.如果birth_yearcohortlower_cutoff之间,则treatment为1,如果在特定国家/地区的cohortupper_cutoff之间,则为0,这意味着country_code变量也应该匹配。 And anything else should be NA.其他任何东西都应该是NA。

#individual level data, birth years
a <- data.frame (country_code = c(2,2,2,10,10,10,10,8), 
                               birth_year = c(1920,1930,1940,1970,1980,1990, 2000, 1910))
#country level reform info with affected cohorts
b <- data.frame(country_code = c(2,10,10,11),
                      lower_cutoff = c(1928, 1975, 1907, 1934),
                      upper_cutoff = c(1948, 1995, 1927, 1948),
                      cohort = c(1938, 1985, 1917, 1942))

The following is the result I want to get:以下是我想要得到的结果:

treatment <- c(NA, 0, 1, NA, 0, 1, NA, NA)

Unfortunately, I cannot merge these two datasets since most of the countries in my dataset have more than one reform.不幸的是,我无法合并这两个数据集,因为我的数据集中的大多数国家/地区都有不止一项改革。

Any ideas on how can I fasten this code?关于如何固定此代码的任何想法? Thank you so much in advance!非常感谢您!

This is a range-based non-equi join.这是一个基于范围的非等值连接。 As such, this can be done with data.table or fuzzyjoin or sqldf .因此,这可以通过data.tablefuzzyjoinsqldf来完成。

data.table data.table

library(data.table)
setDT(a)
setDT(b)
b[, treatment := 1L]
a[b, treatment := i.treatment, on = .(country_code, birth_year >= lower_cutoff, birth_year <= upper_cutoff)]
a[is.na(treatment), treatment := 0L]
a
#    country_code birth_year treatment
#           <num>      <num>     <int>
# 1:            2       1920         0
# 2:            2       1930         1
# 3:            2       1940         1
# 4:           10       1970         0
# 5:           10       1980         1
# 6:           10       1990         1
# 7:           10       2000         0
# 8:            8       1910         0

sqldf sqldf

out <- sqldf::sqldf("select a.*, b.treatment from a left join b on a.country_code=b.country_code and a.birth_year between b.lower_cutoff and b.upper_cutoff")
out$treatment[is.na(out$treatment)] <- 0L
out
#   country_code birth_year treatment
# 1            2       1920         0
# 2            2       1930         1
# 3            2       1940         1
# 4           10       1970         0
# 5           10       1980         1
# 6           10       1990         1
# 7           10       2000         0
# 8            8       1910         0

fuzzyjoin模糊连接

fuzzyjoin::fuzzy_left_join(a, b, by = c("country_code" = "country_code", "birth_year" = "lower_cutoff", "birth_year" = "upper_cutoff"), match_fun = list(`==`, `>=`, `<=`))
#   country_code.x birth_year country_code.y lower_cutoff upper_cutoff cohort treatment
# 1              2       1920             NA           NA           NA     NA        NA
# 2              2       1930              2         1928         1948   1938         1
# 3              2       1940              2         1928         1948   1938         1
# 4             10       1970             NA           NA           NA     NA        NA
# 5             10       1980             10         1975         1995   1985         1
# 6             10       1990             10         1975         1995   1985         1
# 7             10       2000             NA           NA           NA     NA        NA
# 8              8       1910             NA           NA           NA     NA        NA

and then you need to clean up the extra columns (and fill 0 for NA ).然后您需要清理多余的列(并为NA填充0 )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM