简体   繁体   English

使用NA删除基于表的过滤器

[英]Remove filter based on table with NA

I'm assigning rows of data to several different groups. 我正在为几个不同的组分配数据行。 The main issue is there are many groups, but not every group is using the same set of fields. 主要问题是有许多组,但并非每个组都使用相同的字段集。 I would like to set up a reference table that I could loop over or shove through a function but I don't know how to remove the fields from the filter where they are unneeded. 我想设置一个参考表,我可以循环或推动函数,但我不知道如何从过滤器中删除不需要的字段。

Below is sample code, I've included a version of my current solution as well as an example table. 下面是示例代码,我已经包含了当前解决方案的一个版本以及一个示例表。

library(data.table)

set.seed(1)

n <- 1000

#Sample Data
ExampleData <- data.table(sample(1:3,n,replace = TRUE),
                          sample(10:12,n,replace = TRUE),
                          sample(letters[1:3],n,replace = TRUE),
                          sample(LETTERS[1:3],n,replace = TRUE))

#Current solution
ExampleData[V1 == 1 & V2 == 11 & V4 == "C", Group := "Group1"]
ExampleData[V1 == 2, Group := "Group2"]
ExampleData[V1 == 3 & V3 == "a" & V4 == "B", Group := "Group3"]


#Example reference table
ExampleRefTable <- data.table(Group = c("Group1","Group2","Group3"),
                              V1 = c(1,2,3),
                              V2 = c(11,NA,NA),
                              V3 = c(NA,NA,"a"),
                              V4 = c("C",NA,"B"))

(Thanks to @eddi:) You could iterate over rows/groups in the ref table with by= : (感谢@eddi :)您可以使用by=迭代ref表中的行/组:

ExampleRefTable[, 
  ExampleData[copy(.SD), on = names(.SD)[!is.na(.SD)], grp := .BY$Group]
, by = Group] 

For each Group, we are using .SD (the rest of the S ubset of the ref table D ata) for an update join, ignoring columns of .SD that are NA. 对于每个组,我们使用.SD (该其余S裁判表的ubset D ATA)的更新联接,忽略的列.SD是NA。 .BY contains the per-group values of by= . .BY包含by=的每组值。


(My original answer:) You could split up the ref table into subsets with non-NA values: (我的原始答案:)您可以将ref表拆分为具有非NA值的子集:

ExampleRefTable[, gNA := .GRP, by=ExampleRefTable[, !"Group"]]

RefTabs = lapply(
  split(ExampleRefTable, by="gNA", keep.by = FALSE), 
  FUN = Filter, f = function(x) !anyNA(x)
)

which looks like 看起来像

$`1`
    Group V1 V2 V4
1: Group1  1 11  C

$`2`
    Group V1
1: Group2  2

$`3`
    Group V1 V3 V4
1: Group3  3  a  B

Then iterate over these tables with update joins: 然后使用更新连接迭代这些表:

ExampleData[, Group := NA_character_]
for (i in seq_along(RefTabs)){
  RTi = RefTabs[[i]]
  nmi = setdiff(names(RTi), "Group")

  ExampleData[is.na(Group), Group := 
    RTi[copy(.SD), on=names(.SD), x.Group]
  , .SDcols=nmi][]
} 

rm(RTi, nmi)

By filtering on is.na(Group) , I'm assuming that the rules in the ref table are mutually exclusive. 通过对is.na(Group)过滤,我假设ref表中的规则是互斥的。

The copy on .SD is needed due to an open issue . 由于存在未解决的问题,因此需要.SD上的copy

This might be more efficient than @eddi's way (at the top of this answer) if there are many groups sharing the same missing/nonmissing columns. 如果有许多组共享相同的缺失/非缺失列,这可能比@ eddi的方式(在此答案的顶部)更有效。


If you are manually writing your ref table, I would suggest... 如果您手动编写参考表,我建议......

rbindlist(idcol = "Group", fill = TRUE, list(
  NULL = list(V1 = numeric(), V2 = numeric(), V3 = character(), V4 = character()),
  Group1 = list(V1 = 1, V2 = 11, V4 = "C"),
  Group2 = list(V1 = 2),
  Group3 = list(V1 = 3, V3 = "a", V4 = "B")
))


    Group V1 V2   V3   V4
1: Group1  1 11 <NA>    C
2: Group2  2 NA <NA> <NA>
3: Group3  3 NA    a    B

for easier reading and editing. 便于阅读和编辑。

We can loop through the reference data frame and compare it to the example data assigning groups if the conditions are correct, this scales with any size reference table and data, although you may want to vectorize some things if the data is >~100k: 如果条件正确,我们可以遍历参考数据框并将其与示例数据分配组进行比较,这可以与任何大小的参考表和数据进行比例,但如果数据> ~100k,您可能想要对某些内容进行矢量化:

lenC<-ncol(ExampleRefTable)
lenT<-nrow(ExampleRefTable)
lenDat<-nrow(ExampleData)

ExampleData$Group<-"NA"

for(i in 1:lenT){
iter=i
Group_Assign<-ExampleRefTable[i,1]
Vals<-ExampleRefTable[iter,2:lenC]
  for(i in 1:lenDat){
    LogicArray<-ExampleData[i,1:4]==Vals
      if(all(LogicArray, na.rm=T)==T){
        ExampleData[i]$Group<-Group_Assign
      }else{
      }
 }
}

> ExampleData
      V1 V2 V3 V4  Group
   1:  1 11  c  C Group1
   2:  2 12  c  B Group2
   3:  2 11  c  A Group2
   4:  3 12  b  B     NA
   5:  1 10  a  C     NA
  ---                   
 996:  3 12  a  B Group3
 997:  2 10  a  C Group2
 998:  1 10  a  A     NA
 999:  1 10  a  B     NA
1000:  1 11  b  C Group1

This example assumes that NA in the reference data can be matched to any value in the example data as long as the position is correct eg: 此示例假定参考数据中的NA可以与示例数据中的任何值匹配,只要位置正确,例如:

#This is assigned Group1 since NA in the ref.table matched c in pos.3
> ExampleRefTable 
   V1 V2 V3 V4  Group 
1:  1 11 NA  C Group1
> ExampleData
   V1 V2 V3 V4  Group
1:  1 11  c  C Group1 

If NA is supposed to be matched to only NA values (which none were in the example data), you will change this code: 如果NA应该仅匹配NA值(示例数据中没有),您将更改此代码:

 for(i in 1:lenDat){
    LogicArray<-ExampleData[i,1:4]==Vals
    A<-Vals
    B<-ExampleData[i,1:4]
    NAA<-is.na(A)
    NAB<-is.na(B)
    if(all(NAA==NAB)==T && all(LogicArray, na.rm=T)==T){
        ExampleData[i]$Group<-Group_Assign
    }else{
    }
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM