简体   繁体   English


[英]Faster Way to Create a Subset within a Loop or Apply Function in R

I'm new to R, so apologies in advance for bad form in my code. 我是R的新手,所以请提前为我的代码中的错误表单道歉。

I'm trying to figure out the best way to go through a dataframe, row by row, and modify a value based on logic that references other columns within that row or an entirely different dataframe. 我试图找出逐行遍历数据框的最佳方法,并根据引用该行中其他列的逻辑或完全不同的数据框修改值。 The issue is that the logic I'm using necessitates creating and subsetting a dataframe for each row to retrieve a minimum value. 问题是我正在使用的逻辑需要为每一行创建和子集化数据帧以检索最小值。 My real data set is 47000 rows and 15 columns, so creating 47,000 subsets is taking a long time. 我的实际数据集是47000行和15列,因此创建47,000个子集需要很长时间。

Here are sample datasets to help describe what I'm talking about. 以下是帮助描述我正在谈论的内容的示例数据集。

df1 <- data.frame('A' = c(rep("Beer", 2), rep("Chip", 2)), 'B' = c(NA, 3,
       NA,9), 'C' = 5:8, 'D' = NA)
df2 <- data.frame('Q' = c(rep("Beer", 2), rep("Chip", 2)), 'R' = 6:9, 'S' = 
       c(12, 15, 4, 18), 'T' = c(23, 45, 75, 34)) 

df1: DF1:

  A    B    C    D
 Beer  NA   5    NA
 Beer  3    6    NA
 Chip  NA   7    NA
 Chip  9    8    NA

df2: DF2:

  Q    R    S    T
 Beer  6    12    23
 Beer  7    15    45
 Chip  8    4     75
 Chip  9    18    34

This loop does what I want, namely checking whether a value is NA in column B or not, if it isn't then use that value in for column D, if it is NA then retrieve the minimum value from a filtered subset of df2. 这个循环做了我想要的,即检查列B中的值是否为NA,如果不是,则在列D中使用该值,如果它是NA,则从过滤的df2子集中检索最小值。 In the real use case I have other filtering conditions. 在实际使用案例中,我有其他过滤条件。


for (i in 1:nrow(df1)) {
  if (!(is.na(df1$B[i]))) {
    df1$D[i] <- df1$B[i]}
  else {x <- filter(df2,  df1$A[i] == df2$Q)
      x <- min(x$S)
      df1$D[i] <- x

Everyone says to avoid loops in R, so I created this function using apply which also works (although is a little more difficult to follow): 每个人都说要避免R中的循环,所以我使用apply创建了这个函数,这也有效(虽然有点难以理解):

FUNC <- function(x) {
  apply(x, 1, function(y) {
    if (!(is.na(y[2]))) {
      y[4] <- y[2]}
    else {z <- filter(df2,  y[1] == df2$Q)
    z <- min(z$S)
    y[4] <- z}

df1$D <- as.numeric(FUNC(df1))

Output: 输出:

     A    B    C    D
    Beer  NA   5    12
    Beer  3    6    3
    Chip  NA   7    4
    Chip  9    8    9

Aside question: is there a way to reference items in vector y by name instead of by index position? 除了问题:有没有办法按名称而不是索引位置引用向量y的项目?

So is there a better way to do this? 那么有更好的方法吗? Right now both methods take about 5-8 minutes to run through 47,000+ rows which seems long to me. 现在这两种方法需要大约5-8分钟来运行47,000多行,这对我来说似乎很长。

df1$D <- df2 %>% 
  rename(A=Q) %>% 
  group_by(A) %>% 
  summarise(D=min(S)) %>% 
  right_join(df1, by="A") %>% 
  mutate(D=ifelse(is.na(B), D.x, B)) %>% 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM