简体   繁体   中英

Applying a function to the whole data table by groups

Let's suppose the following data table:

a = runif(40)
b = c(rep(NA,5), runif(5), rep(NA,3),runif(3),NA,runif(3), c(rep(NA,3), runif(7), rep(NA,4), runif(3), NA,NA, runif(1)))
c = rep(1:4,each=10)
DT = data.table(a,b,c)

I want to eliminate the rows with the first NA values in b for every unique value in c (first NAs when c==1, when c==2...), but not the rows with the NAs that come after.

I can do it by using a loop:

for(i in unique(DT$c))
{
  first_NA = which(DT$c==i)[1]
  last_NA = which(!is.na(DT[,b]) & DT$c==i)[1] - 1
  DT = DT[-c(first_NA:last_NA)]
}

But I wonder if there is any simpler way of doing this by using a function for the whole data table using groups (by in data table or groupby in dplyr), without just applying it to columns.

Thank you!

You have to mark these lines then keep those not marked.

# mark values
DT <- DT[, by=c, 
  flag := is.na(b[1]) # first value of b is NA
    & (seq_len(.N)==1) # only for first value
]

# discard marked
DT <- DT[(!flag)]

# remove flag
DT[, flag:=NULL]

or in a row

DT[, by=c, flag:=is.na(b[1]) & (seq_len(.N)==1)][(!flag)][, flag:=NULL]

You can filter out the first NA values in b through

DT[, .SD[cumsum( !is.na(b) ) != 0], by = .(c)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM