简体   繁体   English

从data.frame获取行,该行满足由R中任意一个子条件组成的条件

[英]Get row(s) from data.frame that satisfy a condition composed by an arbitrary amout of sub-conditions in R

I have a data.frame that can contains N columns (N defined at runtime), and I want to get the rows within the data frame that satisfy N-1 conditions, in other words I want to get only the rows with a specific value for the first N-1 columns. 我有一个data.frame,可以包含N列(N在运行时定义),我想得到数据帧中满足N-1条件的行,换句话说我只想获得具有特定值的行对于第一批N-1列。

For instance if I have a data frame with four columns (A,B,C,D) and five rows: 例如,如果我有一个包含四列(A,B,C,D)和五行的数据框:

A B C D
1 2 3 4
9 9 9 9
1 2 9 5
4 3 2 1
1 2 3 8

I would get all the rows with A==1 & B==2 & C==3, ie: 我会得到A == 1&B == 2&C == 3的所有行,即:

A B C D
1 2 3 4
1 2 3 8

But as said, the data frame can be composed of any amount of rows and columns (defined at runtime), and the values of the conditions may change. 但如上所述,数据框可以由任意数量的行和列组成(在运行时定义),并且条件的值可能会发生变化。

I implemented this function (simplified): 我实现了这个功能(简化):

getRows<-function(dataFrame, values) {
  conditions=rep(TRUE, dim(dataFrame)[1])
  for (k in 1:length(values)) {
    conditions=conditions&(dataFrame[,k]==values[k])
  }
  return(dataFrame[conditions,])
}

Of course, this assumes the values in the values vector are sorted with respect to the columns order of the data frame, and that the length of the vector is N-1. 当然,这假设值向量中的值相对于数据帧的列顺序排序,并且向量的长度是N-1。

The function works but I've the feeling that it is not really efficient to create the vector of boolean, evaluate boolean expressions in this way and so on... especially if the data frame contains many data. 该函数有效,但我觉得创建布尔向量,以这种方式评估布尔表达式等等并不是很有效...特别是如果数据帧包含许多数据。

Another solution that I found is: 我发现的另一个解决方案是:

getRows<-function(dataFrame, values) {
  tmp=dataFrame
  for (k in 1:length(values)) {
    tmp=tmp[tmp[,k]==values[k],]
  }
  return(tmp)
}

Basically this 'reduces' the data frame by filtering out all the rows that not satisfy each condition. 基本上,这通过过滤掉不满足每个条件的所有行来“减少”数据帧。 But I think this is even worst, because it creates a new data frame object for each condition (ok always smaller, but anyway...). 但我认为这甚至是最糟糕的,因为它为每个条件创建了一个新的数据框对象(确定总是更小,但无论如何......)。

So my question is: is there a method to do that more efficiently? 所以我的问题是:有没有一种方法可以更有效地做到这一点?

one possibility: 一种可能性:

# if you are only checking for equalities
f <- function(df, values){
  # values must be a list with the columns names of df as names and the conditions
  # if you 
  y <- paste(names(values), unlist(values), sep="==", collapse=" & ")
  return(df[eval(parse(text=y), envir=df),])
  }

 l <- as.vector(1:3, "list")
 names(l) <- colnames(df)[-ncol(df)]

 f(df, l)
   A B C D
 1 1 2 3 4
 5 1 2 3 8

# you can also use other conditions
f <- function(df, values){
  # values must be a list with the columns names of df as names and the conditions
  # if you 
  y <- paste(names(values), unlist(values), collapse=" & ")
  return(df[eval(parse(text=y), envir=df),])
  }

 l <- as.vector(paste0(c("==", "<=", "=="), 1:3), "list")
 names(l) <- colnames(df)[-ncol(df)]

f(df, l)
  A B C D
1 1 2 3 4
5 1 2 3 8

Sometimes matrices are quicker than data.frames to operate on, so something along the lines of: 有时矩阵比data.frames更快,所以有些东西:

mat <- t(as.matrix(df[-ncol(df)))
boolMat <- (mat==values) # if necessary use match to reorder values to match columns of df
ind <- colSums(boolMat)==nrow(boolMat)
df[ind,]

The idea being that values will get recycled along the columns of the matrix (which are the rows of the dataframe). 我们的想法是, values将沿着矩阵的列(数据帧的行)进行回收。 colSums is meant to be quicker than an apply , so the final line should be somewhat optimised compared to apply(boolMat, 2, all) . colSums意味着比apply更快,因此与apply(boolMat, 2, all)相比,最后一行应该稍微优化apply(boolMat, 2, all)

The optimal solutions will depend on the size and proportions of the data; 最佳解决方案将取决于数据的大小和比例; whether the entries are all integers; 条目是否都是整数; and maybe what proportion of matches you get in the data. 也许你在数据中得到的比例是多少。 So as @droopy mentions, you'll need to benchmark. 所以@droopy提到,你需要进行基准测试。 My approach involves creating a copy of the data, so if your data is already approaching memory limits, then it might struggle - but maybe then you could generate your data in matrix rather than data.frame format to save the duplication. 我的方法涉及创建数据的副本,因此如果您的数据已经接近内存限制,那么它可能会很困难 - 但也许您可以生成矩阵而不是data.frame格式的数据以保存重复。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM