[英]More efficient version of this R loop
I'm used to Python and JS, and pretty new to R, but enjoying it for data analysis. 我习惯了Python和JS,对R还是很陌生,但是喜欢它进行数据分析。 I was looking to create a new field in my data frame, based on some if/else logic, and tried to do it in a standard/procedural way:
我试图基于一些if / else逻辑在数据框中创建一个新字段,并尝试以一种标准/过程方式进行操作:
for (i in 1:nrow(df)) {
if (is.na(df$First_Payment_date[i]) == TRUE) {
df$User_status[i] = "User never paid"
} else if (df$Payment_Date[i] >= df$First_Payment_date[i]) {
df$User_status[i] = "Paying user"
} else if (df$Payment_Date[i] < df$First_Payment_date[i]) {
df$User_status[i] = "Attempt before first payment"
} else {
df$User_status[i] = "Error"
}
}
But it was CRAZY slow. 但这是疯狂的缓慢。 I tried running this on a data frame of ~3 million rows, and it took way, way too long.
我尝试在大约300万行的数据帧上运行它,但是过程太长了。 Any tips on the "R" way of doing this?
关于“ R”方式的任何提示吗?
Note that the df$Payment_Date
and df$First_Payment_date
fields are formatted as dates. 需要注意的是
df$Payment_Date
和df$First_Payment_date
字段被格式化为日期。
If you initialize to "error" and then overwrite for the conditions enumerated using logical indexing this should be much faster. 如果您初始化为“错误”,然后使用逻辑索引为列举的条件覆盖,则速度会更快。 Those if(){}else{} statements for every row are killing you.
每行的if(){} else {}语句正在杀死您。
df$User_status <- "Error"
df$User_status[ is.na(df$First_Payment_date) ] <- "User never paid"
df$User_status[ df$Payment_Date >= df$First_Payment_date ] <- "Paying user"
df$User_status[ df$Payment_Date < df$First_Payment_date ] <- "Attempt before first payment"
I am benchmarking data.frame
and data.table
for relatively large dataset. 我正在为相对较大的数据集设置
data.frame
和data.table
基准。
First we generate some data. 首先,我们生成一些数据。
set.seed(1234)
library(data.table)
df = data.frame(First_Payment_date=c(sample(c(NA,1:100),1000000, replace=1)),
Payment_Date=c(sample(1:100,1000000, replace=1)))
dt = data.table(df)
Then set up the benchmark. 然后设置基准。 I am testing between @BondedDust's answer and its
data.table
equivalence. 我正在@BondedDust的答案与其
data.table
等效性之间进行测试。 I have slightly modified (debug) his code. 我对他的代码进行了一些修改(调试)。
library(microbenchmark)
test_df = function(){
df$User_status <- "Error"
df$User_status[ is.na(df$First_Payment_date) ] <- "User never paid"
df$User_status[ df$Payment_Date >= df$First_Payment_date ] <- "Paying user"
df$User_status[ df$Payment_Date < df$First_Payment_date ] <- "Attempt before first payment"
}
test_dt = function(){
dt[, User_status := "Error"]
dt[is.na(First_Payment_date), User_status := "User never paid"]
dt[Payment_Date >= First_Payment_date, User_status := "Paying user"]
dt[Payment_Date < First_Payment_date, User_status := "Attempt before first payment"]
}
microbenchmark(test_df(), test_dt(), times=10)
The result: data.table
is 4x faster than data.frame
for the generated 1 million rows data. 结果:
data.table
是4倍的速度比data.frame
为产生一个百万行数据。
> microbenchmark(test_df(), test_dt(), times=10)
Unit: milliseconds
expr min lq median uq max neval
test_df() 247.29182 256.69067 287.89768 319.34873 330.33915 10
test_dt() 66.74265 69.42574 70.27826 72.93969 80.89847 10
Note 注意
data.frame
is faster than data.table
for small dataset (say, 10000 rows.) data.frame
快于data.table
为小型数据集(比如说,10000行。)
I'm not certain that this will speed it up a lot, but you should see some improvement over the for
loop you had before. 我不确定这会加快很多速度,但是您应该看到比以前的
for
循环有所改进。 The else
's aren't really necessary under these conditions. 在这些情况下,
else
并不是必须的。
Also, R has functions that act as for
loops, and other types of loops. 此外,R具有充当功能
for
环路,和其他类型的循环。 See ?apply
. 请参阅
?apply
。
Give this a shot, see how it works. 试一试,看看它是如何工作的。 I can't test it since we don't have your data.
由于我们没有您的数据,因此无法测试。
> df$User_status[i] <- rep("Error", nrow(df))
## allocate a vector, fill it with "Error"
> sapply(seq(nrow(df)), function(i){
if(is.na(df$First_Payment_date[i])){
gsub("Error", "User never paid", df$User_status[i]) }
if(df$Payment_Date[i] >= df$First_Payment_date[i]){
gsub("Error", "Paying user", df$User_status[i]) }
if (df$Payment_Date[i] < df$First_Payment_date[i]) {
gsub("Error", "Attempt before first payment", df$User_status[i]) }
})
The usual way to handle this sort of thing is via ifelse
. 处理这种事情的通常方法是通过
ifelse
。
df$User_status <- with(df,
ifelse(is.na(First_Payment_date), "User never paid",
ifelse(Payment_Date >= First_Payment_date, "Paying user",
ifelse(Payment_Date < First_Payment_date, "Attempt before first payment",
"Error"))))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.