简体   繁体   中英

R loop using data.table

I have a dataset of about 3 million rows. I created a small example shown below:

ex <- data.table(eoc = c(1,1,1,1,1,2,2,2,3,3), proc1 = c(63035,63020,92344,63035,27567,63020,1234,55678,61112,1236), trigger_cpt = c(63020,63020,63020,63020,63020,63020,63020,63020,61112,61112))

I have another dataset of 42 rows, but have produced a smaller example:

add_on <- data.table(primary = c(63020,61112), secondary=c(63035,63445))

I need to relabel certain rows on the "trigger_cpt" column (grouped by eoc) if the trigger_cpt value happens to be one of the values in the primary column of the dataset and if there is a proc1 value that is the secondary value in the add_on dataset. If it meets the criteria, then trigger_cpt should be relabeled to the secondary code.

I was first typing everything manually, ex[,trigger_new:= if(any(trigger_cpt == '63020' & proc1 == '63035')) 63035 else trigger_cpt, eoc]

and then decided to do a for loop

for(i in 1:nrow(add_on)){
  ex[,trigger_new2 := if(any(trigger_cpt == add_on[i,1] & proc1 == add_on[i,2])) add_on[i,2] else trigger_cpt, eoc]
}

however, now that I'm trying this code on my 3 million row dataset, it's been taking a long time to run. I'm not sure if there's a better approach or if there's any modifications I could make to my current code?

any help would be greatly appreciated!

expected output:

ex_final <- data.table(eoc = c(1,1,1,1,1,2,2,2,3,3), proc1 = c(63035,63020,92344,63035,27567,63020,1234,55678,61112,1236), trigger_cpt = c(63035,63035,63035,63035,63035,63020,63020,63020,61112,61112))

Here is one way which produces a data.table that sets all of trigger_cpt to the secondary value if a match is found in the grouped set:


ex2 <- add_on[ex, , on=.(primary=trigger_cpt)][ , trigger_new := fifelse( secondary %in% proc1, secondary, NA_real_ ), by=eoc ]
ex.final  <- ex2[ , trigger_cpt := fcoalesce( trigger_new, primary ) ][, .(eoc,proc1,trigger_cpt) ]

Output:


> ex.final
    eoc proc1 trigger_cpt
 1:   1 63035       63035
 2:   1 63020       63035
 3:   1 92344       63035
 4:   1 63035       63035
 5:   1 27567       63035
 6:   2 63020       63020
 7:   2  1234       63020
 8:   2 55678       63020
 9:   3 61112       61112
10:   3  1236       61112

Furthermore, if feasable (it comes at a cost), I consider using setkey unless it causes more harm than good. (initial processing might make it not worth it). It speeds up downstream operations and it might make join code a lot cleaner . data.table code can be hard enough as it is. Thus:


setkey(ex, trigger_cpt )
setkey(add_on, primary )

## can now do this:
add_on[ex]

## instead of this:
add_on[ex, , on=.(primary=trigger_cpt)]

## .. in the code above.

... Furthermore...

If you are reworking the steps of above, you will notice that add_on[ex] (which is the somewhat backwards way to do left joins in data.table), leaves you with the key column names of add_on , and not of ex . This is not important, as long as you are aware and rename columns apropriately in the end, but an alternative way of joining the data might be this:


ex2 <- merge( ex, add_on, by.x="trigger_cpt", by.y="primary" )
## and then work your way till the end with what this gives you

Based on the expected output

ex[, trigger_new := first(proc1), eoc]




ex
    eoc proc1 trigger_cpt trigger_new
 1:   1 63035       63020       63035
 2:   1 63020       63020       63035
 3:   1 92344       63020       63035
 4:   1 63035       63020       63035
 5:   1 27567       63020       63035
 6:   2 63020       63020       63020
 7:   2  1234       63020       63020
 8:   2 55678       63020       63020
 9:   3 61112       61112       61112
10:   3  1236       61112       61112

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM