简体   繁体   English

此R循环的更有效版本

[英]More efficient version of this R loop

I'm used to Python and JS, and pretty new to R, but enjoying it for data analysis. 我习惯了Python和JS,对R还是很陌生,但是喜欢它进行数据分析。 I was looking to create a new field in my data frame, based on some if/else logic, and tried to do it in a standard/procedural way: 我试图基于一些if / else逻辑在数据框中创建一个新字段,并尝试以一种标准/过程方式进行操作:

for (i in 1:nrow(df)) {
  if (is.na(df$First_Payment_date[i]) == TRUE) {
    df$User_status[i] = "User never paid"
  } else if (df$Payment_Date[i] >= df$First_Payment_date[i]) {
    df$User_status[i] = "Paying user"
  } else if (df$Payment_Date[i] < df$First_Payment_date[i]) {
    df$User_status[i] = "Attempt before first payment"
  } else {
    df$User_status[i] = "Error"
  }
}

But it was CRAZY slow. 但这是疯狂的缓慢。 I tried running this on a data frame of ~3 million rows, and it took way, way too long. 我尝试在大约300万行的数据帧上运行它,但是过程太长了。 Any tips on the "R" way of doing this? 关于“ R”方式的任何提示吗?

Note that the df$Payment_Date and df$First_Payment_date fields are formatted as dates. 需要注意的是df$Payment_Datedf$First_Payment_date字段被格式化为日期。

If you initialize to "error" and then overwrite for the conditions enumerated using logical indexing this should be much faster. 如果您初始化为“错误”,然后使用逻辑索引为列举的条件覆盖,则速度会更快。 Those if(){}else{} statements for every row are killing you. 每行的if(){} else {}语句正在杀死您。

df$User_status <- "Error"
df$User_status[ is.na(df$First_Payment_date) ] <- "User never paid"
df$User_status[ df$Payment_Date >= df$First_Payment_date ] <- "Paying user"
df$User_status[ df$Payment_Date < df$First_Payment_date ] <- "Attempt before first payment"

I am benchmarking data.frame and data.table for relatively large dataset. 我正在为相对较大的数据集设置data.framedata.table基准。

First we generate some data. 首先,我们生成一些数据。

set.seed(1234)
library(data.table)
df = data.frame(First_Payment_date=c(sample(c(NA,1:100),1000000, replace=1)),
                 Payment_Date=c(sample(1:100,1000000, replace=1)))
dt = data.table(df)

Then set up the benchmark. 然后设置基准。 I am testing between @BondedDust's answer and its data.table equivalence. 我正在@BondedDust的答案与其data.table等效性之间进行测试。 I have slightly modified (debug) his code. 我对他的代码进行了一些修改(调试)。

library(microbenchmark)

test_df = function(){
    df$User_status <- "Error"
    df$User_status[ is.na(df$First_Payment_date) ] <- "User never paid"
    df$User_status[ df$Payment_Date >= df$First_Payment_date ] <- "Paying user"
    df$User_status[ df$Payment_Date < df$First_Payment_date ] <- "Attempt before first payment"
}

test_dt = function(){
    dt[, User_status := "Error"]
    dt[is.na(First_Payment_date), User_status := "User never paid"]
    dt[Payment_Date >= First_Payment_date, User_status := "Paying user"]
    dt[Payment_Date < First_Payment_date, User_status := "Attempt before first payment"]
}

microbenchmark(test_df(), test_dt(), times=10)

The result: data.table is 4x faster than data.frame for the generated 1 million rows data. 结果: data.table是4倍的速度比data.frame为产生一个百万行数据。

> microbenchmark(test_df(), test_dt(), times=10)
Unit: milliseconds
      expr       min        lq    median        uq       max neval
 test_df() 247.29182 256.69067 287.89768 319.34873 330.33915    10
 test_dt()  66.74265  69.42574  70.27826  72.93969  80.89847    10

Note 注意

data.frame is faster than data.table for small dataset (say, 10000 rows.) data.frame快于data.table为小型数据集(比如说,10000行。)

I'm not certain that this will speed it up a lot, but you should see some improvement over the for loop you had before. 我不确定这会加快很多速度,但是您应该看到比以前的for循环有所改进。 The else 's aren't really necessary under these conditions. 在这些情况下, else并不是必须的。

Also, R has functions that act as for loops, and other types of loops. 此外,R具有充当功能for环路,和其他类型的循环。 See ?apply . 请参阅?apply

Give this a shot, see how it works. 试一试,看看它是如何工作的。 I can't test it since we don't have your data. 由于我们没有您的数据,因此无法测试。

> df$User_status[i] <- rep("Error", nrow(df)) 
      ## allocate a vector, fill it with "Error"

> sapply(seq(nrow(df)), function(i){

    if(is.na(df$First_Payment_date[i])){ 
      gsub("Error", "User never paid", df$User_status[i]) }

    if(df$Payment_Date[i] >= df$First_Payment_date[i]){
      gsub("Error", "Paying user", df$User_status[i]) }

    if (df$Payment_Date[i] < df$First_Payment_date[i]) {
      gsub("Error", "Attempt before first payment", df$User_status[i]) }

    })

The usual way to handle this sort of thing is via ifelse . 处理这种事情的通常方法是通过ifelse

df$User_status <- with(df,
    ifelse(is.na(First_Payment_date), "User never paid",
    ifelse(Payment_Date >= First_Payment_date, "Paying user",
    ifelse(Payment_Date < First_Payment_date, "Attempt before first payment",
    "Error"))))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM