简体   繁体   中英

data.table apply function on each column

I guess I am pretty close to a solution, but I struggle to combine lapply with data.table. I read a lot about lapply and find some example with data.table, but the way of thinking is new to me and its driving me nuts...

This is my data.table

cells <- c(150, 1,1980,1,1,1,0,0,0,1,2004,3,
       99 , 1,1980,1,1,1,1,0,0,0,2004,4,
       899, 1,1980,0,1,0,1,1,1,1,2007,4,
       789, 1,1982,1,1,1,0,1,1,1,2004,3 )
colname <- c("number","sex", "birthy", "2004","2005", "2006", "2007", "2008", "2009","2010","begy","SeqLen")
rowname <- c("2","3","4","5")
y <- matrix(cells, nrow=4, ncol=12, byrow=TRUE, dimnames = list(rowname,colname))
y <- data.table(y, keep.rownames = TRUE)

I want to step through a vector of column names

cols <- c(paste(2004:2010, sep=" "))

Doing the following operation on just one column works fine!

vec <- "2005"
y[,  (vec) := ifelse((vec) < as.numeric(begy),0, ifelse( ((vec) > as.numeric(begy) + as.numeric(SeqLen) -1) ,0,1)) ]

Creating a function and stepping through the vector seams a good solution, but how? I found this...

dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]

but where can I put my ifelse in this example?

I also read about for-loop and set function, like this...

for (j in cols) set(dt, j = j, value = ifelse((dt[[j]]) < as.numeric(dt[[begy]]),0, ifelse( (dt[[j]] > as.numeric(dt[[begy]]) + as.numeric(dt[[SeqLen]]) -1) ,0,1)))

but this is nonsense.

Thanks Alina

Seems like you are setting the years columns with 1 if it is between begy and begy + SeqLen - 1 for each row. Here is another way to do this:

y[order(rn), 
    (grep("^20", names(y), value=TRUE)) := 
        dcast(y[, seq(begy, by=1, length.out=SeqLen), by=.(rn)], rn ~ V1, length)[,-1L]]
y

output:

   rn number sex birthy 2004 2005 2006 2007 2008 2009 2010 begy SeqLen
1:  2    150   1   1980    1    1    1    0    0    0    0 2004      3
2:  3     99   1   1980    1    1    1    1    0    0    0 2004      4
3:  4    899   1   1980    0    0    0    1    1    1    1 2007      4
4:  5    789   1   1982    1    1    1    0    0    0    0 2004      3

Explanation:

Create a sequence of years for each row, then use dcast to do a one-hot encoding. Use the output to overwrite the years columns.

order(rn) will ensure that we don't overwrite rows incorrectly after dcast


Frank's method is way faster:

y[, as.character(2004:2010) := 
    lapply(2004:2010, function(x) as.integer(between(x, begy, begy + SeqLen - 1)))] 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM