简体   繁体   中英

R data.table: detect pattern of values within each group

Say I have a data.table like this:

set.seed(10)
data.table(group = rep(c("a","b","c"), each=5), date = rep(1:5,3), value = sample(c(95:105,""),15, replace=TRUE))

Within each group, in the value column, I would like to check (in a simple whay) whether there is a ""(empty character), or a group of empty characters, that is both preceded and followed by a value.

So, this is fine: "", 95,103, etc.... (empty character is first within the group), but the patterns below are examples"missing data" that I would like to detect:

95, "", 103,... (empty character in the middle)

95, "","", 103... (several empty characters in the middle)

95, 103, "" (empty character in the end)

So, in the output below, I would be able to get the row/group A, and if there are many groups, I should get all groups (or rows)

    group date value
 1:     a    1   105
 2:     a    2   103
 3:     a    3   104
 4:     a    4      
 5:     a    5   101
 6:     b    1   102
 7:     b    2   100
 8:     b    3   101
 9:     b    4    97
10:     b    5   102
11:     c    1   104
12:     c    2   101
13:     c    3   104
14:     c    4    96
15:     c    5   102

Edit: What I would need do is to select the rows that have the wrong pattern (so empty string(s) in the middle or in the end) , in order to be able to detect whether there are any errors in a large dataset. So in the table in my example, the desired output would be the 4th row as it has a "missing value" (an empty character inbetween values)

     group date value
1:     a    4   

(If there were more unwanted rows, of course, I would like to get all of them)

In case your data.table is not sorted according to 'date' column you can use the following:

DT[order(date), order := c(1:.N) , group]
DT[value == "" & order > 1L]

output:

   group date value order
1:     a    4           4

data is the same as yours:

set.seed(10)
DT <- data.table(group = rep(c("a","b","c"), each=5), date = rep(1:5,3), 
                 value = sample(c(95:105,""),15, replace=TRUE))

Here is an option:

DT[, rw := rleid(value==""), group]
DT[value=="" & rw>1L]

output:

   group date value rw
1:     a    4        2

data:

library(data.table)
set.seed(10)
DT <- data.table(group = rep(c("a","b","c","d"), each=5), 
    date = rep(1:5,4), value = c(sample(c(95:105,""),15, replace=TRUE), c("",2,3,4,5)))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM