简体   繁体   English

在 R 编程中避免 FOR 循环

[英]Avoid FOR loop in R programming

I have 2 dataframes below,我下面有2个数据框,

col1_x <- c(0123,123,234,4567,77789,4578,45588,669887,7887,5547)
col2_x <- c('X1','X8','X2','X55','C12','B11','Z1','SS12','D9','F55')
a    <- c(10,9,8,7,6,5,4,3,2,1)
DF1 <- cbind(col1_x,col2_x,a)
DF1 <- as.data.frame(DF1, stringsAsFactors = F)

col1_y <- c(012,123,56,55,78,5547)
col2_y <- c('X1','X8','S2','ER4','KL1','F55')
b    <- c(111,222,NA,NA,555,666)
DF2 <- cbind(col1_y,col2_y,b)
DF2 <- as.data.frame(DF2, stringsAsFactors = F)

Below are the codes which I written for the execution.以下是我为执行而编写的代码。

# code1
for (i in 1:nrow(DF2)) { 
  if(is.na(DF2$b[i])) {} else {
    DF1 <-mutate(DF1, 
                 a = ifelse(col1_x == DF2$col1_y[i] & col2_x == DF2$col2_y[i],
                            DF2$b[i],a) )
  }
}

# code2
if(is.na(DF2$b)) {} else {
  DF1$a <- ifelse(DF1$col1_x == DF2$col1_y & DF1$col2_x == DF2$col2_y, DF2$b, DF1$a)
}

I am getting warnings as below when I run code2,当我运行 code2 时,我收到如下警告,

Warning messages:
1: In if (is.na(Y$b)) { :
  the condition has length > 1 and only the first element will be used
2: In X$col1 == Y$col1 :
  longer object length is not a multiple of shorter object length
3: In X$col2 == Y$col2 :
  longer object length is not a multiple of shorter object length

Kindly help me how can I fix this without using FOR loop as it takes a lot of time for iterations.请帮助我如何在不使用 FOR 循环的情况下解决此问题,因为迭代需要大量时间。

Note: code1 satisfies my requirement注意:code1 满足我的要求

This accomplished your code1 without the warnings.这在没有警告的情况下完成了您的 code1。

left_join(DF1, DF2, by = c("col1_x" = "col1_y", "col2_x" = "col2_y")) %>%
  mutate(a = coalesce(b, a)) %>%
  select(-b)
#    col1_x col2_x   a
# 1     123     X1  10
# 2     123     X8 222
# 3     234     X2   8
# 4    4567    X55   7
# 5   77789    C12   6
# 6    4578    B11   5
# 7   45588     Z1   4
# 8  669887   SS12   3
# 9    7887     D9   2
# 10   5547    F55 666

If I have interpreted correctly the results that you need, then this is far faster, efficient, and safer than any implementation with for loops and base::ifelse (which can be problematic on its own ).如果我正确地解释了您需要的结果,那么这比任何使用for循环和base::ifelse的实现都更快、更高效、更安全(这本身就有问题)。

To learn more about merges and joins like this, see How to join (merge) data frames (inner, outer, left, right) and https://stackoverflow.com/a/6188334/3358272 .要了解有关此类合并和连接的更多信息,请参阅如何连接(合并)数据帧(内、外、左、右)https://stackoverflow.com/a/6188334/3358272 Really, part of data-science-y tasks is knowing how to deal with data consistently, safely, quickly, efficiently, and... safely.确实,数据科学任务的一部分是知道如何始终如一地、安全地、快速地、高效地和……安全地处理数据。 Yes, I said it twice.是的,我说了两次。 If there is anything in your code that might, just might , confuse one observation with another, all of your results and inferences are at-best questionable if not completely corrupted.如果您的代码中有任何内容可能(只是可能)将一个观察结果与另一个观察结果混淆,那么您的所有结果和推论如果没有完全损坏,充其量也是有问题的。 (I'll get off my </soapbox> now.) (我现在要离开我的</soapbox> 。)


As for your warnings:至于你的警告:

  1. condition has length > 1... . condition has length > 1...

    if statements require a length-1 conditional, period. if语句需要一个长度为 1 的条件句点。 Not length 0, not length 2 or more.不是长度 0,不是长度 2 或更多。 Length 1. Since your Y frame (actually DF2 now) has more than 1 row, this is broken.长度 1。由于您的Y框架(现在实际上是DF2 )有超过 1 行,所以这被打破了。

    Think of it this way: if (true) then do task 1 makes sense.这样想: if (true) then do task 1是有意义的。 if (true, false, false, true true true true, false) do task 1 does not make sense. if (true, false, false, true true true true, false) do task 1没有意义。 What should happen?应该发生什么?

    One of two things are needed here:这里需要两件事之一:

    • You need if , so you should be looking at one of:您需要if ,因此您应该查看以下之一:

      • any(is.na(Y$b)) ; any(is.na(Y$b))
      • all(is.na(Y$b)) ; all(is.na(Y$b)) ; or或者
      • a specific one of them, such as is.na(Y$b[17]) (if there were at least 17 of them)其中一个特定的,例如is.na(Y$b[17]) (如果至少有 17 个)
    • You need ifelse , which would work on a vector of logicals.您需要ifelse ,它可以处理逻辑向量。 (I don't think it's this one.) (我不认为是这个。)

  2. longer object length is not a multiple of shorter object length

    This seems clear, but you don't understand why it's happening.这似乎很清楚,但你不明白为什么会这样。

    Consider these questions:考虑以下问题:

    • c(1,2) == c(1,2) is really asking c(1==1, 2==2) , right? c(1,2) == c(1,2)真的在问c(1==1, 2==2) ,对吗? Good.好的。
    • c(1,2) == 1 is really asking c(1==1, 1==2) . c(1,2) == 1真的在问c(1==1, 1==2) Good.好的。

    (Neither of those would go in an if statement, btw:-) (在if语句中,go 都不会,顺便说一句:-)

    • c(1,2) == c(1,2,3,4) is confusingly not an error in R due to argument-recycling.由于参数回收, c(1,2) == c(1,2,3,4)在 R 中令人困惑地不是错误。 I really think it should be an error, because many of the times it is used/relied-on, it is a mistake, and the results are corrupted/incorrect.我真的认为这应该是一个错误,因为很多时候它被使用/依赖,它是一个错误,结果被破坏/不正确。 However, this is really producing c(1==1, 2==2, 1==3, 2==4) .但是,这实际上是在产生c(1==1, 2==2, 1==3, 2==4) Yup, recycling.是的,回收。 And while not a warning/error, this might be useful but is often a silent mistake.虽然不是警告/错误,但这可能很有用,但通常是一个无声的错误。 This only works though when the length of one vector is a perfect multiple of the length of the other vector.这仅在一个向量的长度是另一个向量长度的完美倍数时才有效。

    • c(1,2,9) == c(1,2,3,4,5) will try to recycle as c(1==1, 2==2, 9==3, 1==4, 2==5) (and will give results for that), but... doesn't that seem just a bit odd to you? c(1,2,9) == c(1,2,3,4,5)将尝试回收为c(1==1, 2==2, 9==3, 1==4, 2==5) (并且会给出结果),但是……这对你来说是不是有点奇怪? Well, it might be okay to you, and while there might be legitimate uses of this type of recycling, it more than often (in my experience) is a mistake in code.好吧,这对你来说可能没问题,虽然这种类型的回收可能有合法用途,但它通常(根据我的经验)是代码中的错误。 If you really mean this and you really know that this type of arbitrary comparisons is what you really want, then wrap it in suppressWarnings and don't come to me when your data results are seemingly inconsistent with the inputs.如果您真的是这个意思,并且您确实知道这种任意比较是您真正想要的,那么请将其包装在suppressWarnings中,并且当您的数据结果似乎与输入不一致时不要来找我。

    More than often when questions pop up with this, instead of == , people should be thinking "set operations", where they need %in% .当问题出现时,人们通常应该考虑“设置操作”,而不是== ,他们需要%in% Now, think of these:现在,想想这些:

    • c(1,2,9) %in% c(1,2,3,4,5) yields c(TRUE, TRUE, FALSE) . c(1,2,9) %in% c(1,2,3,4,5)产生c(TRUE, TRUE, FALSE) (Length 3, not length 5.) You're asking c("is 1 in 1:5?", "is 2 in 1:5?", "is 9 in 1:5?") . (长度为 3,而不是长度 5。)您在问c("is 1 in 1:5?", "is 2 in 1:5?", "is 9 in 1:5?")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM