简体   繁体   English

R data.table:检测每组内的值模式

[英]R data.table: detect pattern of values within each group

Say I have a data.table like this:假设我有一个像这样的 data.table:

set.seed(10)
data.table(group = rep(c("a","b","c"), each=5), date = rep(1:5,3), value = sample(c(95:105,""),15, replace=TRUE))

Within each group, in the value column, I would like to check (in a simple whay) whether there is a ""(empty character), or a group of empty characters, that is both preceded and followed by a value.在每个组中,在 value 列中,我想检查(以简单的方式)是否有“”(空字符)或一组空字符,它们前后都有一个值。

So, this is fine: "", 95,103, etc.... (empty character is first within the group), but the patterns below are examples"missing data" that I would like to detect:所以,这很好:"", 95,103, etc.... (空字符在组中的第一个),但下面的模式是我想检测的“缺失数据”示例:

95, "", 103,... (empty character in the middle) 95, "", 103,...(中间为空字符)

95, "","", 103... (several empty characters in the middle) 95, "","", 103...(中间几个空字符)

95, 103, "" (empty character in the end) 95, 103, ""(最后是空字符)

So, in the output below, I would be able to get the row/group A, and if there are many groups, I should get all groups (or rows)所以,在下面的输出中,我将能够得到行/组 A,如果有很多组,我应该得到所有组(或行)

    group date value
 1:     a    1   105
 2:     a    2   103
 3:     a    3   104
 4:     a    4      
 5:     a    5   101
 6:     b    1   102
 7:     b    2   100
 8:     b    3   101
 9:     b    4    97
10:     b    5   102
11:     c    1   104
12:     c    2   101
13:     c    3   104
14:     c    4    96
15:     c    5   102

Edit: What I would need do is to select the rows that have the wrong pattern (so empty string(s) in the middle or in the end) , in order to be able to detect whether there are any errors in a large dataset.编辑:我需要做的是选择具有错误模式的行(中间或最后是空字符串) ,以便能够检测大型数据集中是否存在任何错误。 So in the table in my example, the desired output would be the 4th row as it has a "missing value" (an empty character inbetween values)因此,在我的示例中的表中,所需的输出将是第 4 行,因为它具有“缺失值”(值之间的空字符)

     group date value
1:     a    4   

(If there were more unwanted rows, of course, I would like to get all of them) (当然,如果有更多不需要的行,我想获取所有行)

In case your data.table is not sorted according to 'date' column you can use the following:如果您的 data.table 未根据“日期”列排序,您可以使用以下内容:

DT[order(date), order := c(1:.N) , group]
DT[value == "" & order > 1L]

output:输出:

   group date value order
1:     a    4           4

data is the same as yours:数据和你的一样:

set.seed(10)
DT <- data.table(group = rep(c("a","b","c"), each=5), date = rep(1:5,3), 
                 value = sample(c(95:105,""),15, replace=TRUE))

Here is an option:这是一个选项:

DT[, rw := rleid(value==""), group]
DT[value=="" & rw>1L]

output:输出:

   group date value rw
1:     a    4        2

data:数据:

library(data.table)
set.seed(10)
DT <- data.table(group = rep(c("a","b","c","d"), each=5), 
    date = rep(1:5,4), value = c(sample(c(95:105,""),15, replace=TRUE), c("",2,3,4,5)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM