[英]How to fasten nested for-loop R
I have two datasets, and one of them is very big.我有两个数据集,其中一个非常大。 I'm trying to run the following loop to create a treatment column,
treatment
, in the dataset a
.我正在尝试运行以下循环以在数据集
a
中创建处理列treatment
。 However, it is way too slow.但是,速度太慢了。 I looked for a way to fasten for-loops like vectorization or defining conditions outside the loops however I'm having a hard time applying those methods since I have two datasets I'm conditioning on.
我寻找了一种方法来固定 for 循环,如矢量化或定义循环外的条件,但是我很难应用这些方法,因为我有两个要调节的数据集。
Here is my code:这是我的代码:
reform_loop <- function(a, b){
for(i in 1:nrow(a)) {
for(j in 1:nrow(b)){
if(!is.na(a[i,"treatment"])){break}
a[i,"treatment"] <- case_when(a[i,"country_code"] == b[j, "country_code"] &
a[i,"birth_year"] >= b[j,"cohort"] &
a[i,"birth_year"]<= b[j,"upper_cutoff"] ~ 1,
a[i,"country_code"] == b[j, "country_code"] &
a[i,"birth_year"] < b[j,"cohort"]&
a[i,"birth_year"]>= b[j,"lower_cutoff"] ~ 0)
}
}
return(a)
}
a <- reform_loop(a, b)
You can find a sample dataset below.您可以在下面找到示例数据集。 Dataset
a
is an individual dataset with birth year informations and dataset b
is country-level data with some country reform information.数据集
a
是包含出生年份信息的个人数据集,数据集b
是包含一些国家改革信息的国家级数据。 treatment
is 1 if the birth_year
is between the cohort
and upper_cutoff
and 0 if between cohort
and lower_cutoff
in a specific country which means country_code
variables should also be matched.如果
birth_year
在cohort
和lower_cutoff
之间,则treatment
为1,如果在特定国家/地区的cohort
和upper_cutoff
之间,则为0,这意味着country_code
变量也应该匹配。 And anything else should be NA.其他任何东西都应该是NA。
#individual level data, birth years
a <- data.frame (country_code = c(2,2,2,10,10,10,10,8),
birth_year = c(1920,1930,1940,1970,1980,1990, 2000, 1910))
#country level reform info with affected cohorts
b <- data.frame(country_code = c(2,10,10,11),
lower_cutoff = c(1928, 1975, 1907, 1934),
upper_cutoff = c(1948, 1995, 1927, 1948),
cohort = c(1938, 1985, 1917, 1942))
The following is the result I want to get:以下是我想要得到的结果:
treatment <- c(NA, 0, 1, NA, 0, 1, NA, NA)
Unfortunately, I cannot merge these two datasets since most of the countries in my dataset have more than one reform.不幸的是,我无法合并这两个数据集,因为我的数据集中的大多数国家/地区都有不止一项改革。
Any ideas on how can I fasten this code?关于如何固定此代码的任何想法? Thank you so much in advance!
非常感谢您!
This is a range-based non-equi join.这是一个基于范围的非等值连接。 As such, this can be done with
data.table
or fuzzyjoin
or sqldf
.因此,这可以通过
data.table
或fuzzyjoin
或sqldf
来完成。
library(data.table)
setDT(a)
setDT(b)
b[, treatment := 1L]
a[b, treatment := i.treatment, on = .(country_code, birth_year >= lower_cutoff, birth_year <= upper_cutoff)]
a[is.na(treatment), treatment := 0L]
a
# country_code birth_year treatment
# <num> <num> <int>
# 1: 2 1920 0
# 2: 2 1930 1
# 3: 2 1940 1
# 4: 10 1970 0
# 5: 10 1980 1
# 6: 10 1990 1
# 7: 10 2000 0
# 8: 8 1910 0
out <- sqldf::sqldf("select a.*, b.treatment from a left join b on a.country_code=b.country_code and a.birth_year between b.lower_cutoff and b.upper_cutoff")
out$treatment[is.na(out$treatment)] <- 0L
out
# country_code birth_year treatment
# 1 2 1920 0
# 2 2 1930 1
# 3 2 1940 1
# 4 10 1970 0
# 5 10 1980 1
# 6 10 1990 1
# 7 10 2000 0
# 8 8 1910 0
fuzzyjoin::fuzzy_left_join(a, b, by = c("country_code" = "country_code", "birth_year" = "lower_cutoff", "birth_year" = "upper_cutoff"), match_fun = list(`==`, `>=`, `<=`))
# country_code.x birth_year country_code.y lower_cutoff upper_cutoff cohort treatment
# 1 2 1920 NA NA NA NA NA
# 2 2 1930 2 1928 1948 1938 1
# 3 2 1940 2 1928 1948 1938 1
# 4 10 1970 NA NA NA NA NA
# 5 10 1980 10 1975 1995 1985 1
# 6 10 1990 10 1975 1995 1985 1
# 7 10 2000 NA NA NA NA NA
# 8 8 1910 NA NA NA NA NA
and then you need to clean up the extra columns (and fill 0
for NA
).然后您需要清理多余的列(并为
NA
填充0
)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.