通过数据帧迭代R效率

Question

I am working with a large data set, lets call it data , and want to create a new column, lets call it data$results based off of some column data$input . 我正在使用大型数据集，可以将其称为data ，并希望创建一个新列，可以基于某些列data$input将其命名为data$results 。 The results are based off of some conditional if/then logic, so my original approach was something like: 结果基于某些条件的if / then逻辑，所以我最初的方法是：

for (rows in data) {
    data$results <- if(data$results == "1" | data$results== "2") {
        trueAnswer
    } else {
        falseAnswer
    }
}

With large data frames, this process might take several hours to run. 对于大数据帧，此过程可能需要几个小时才能运行。 However, if I subset the data into a data frame containing only entries where data$results is 1 or 2 and another where that is not true, I can just apply trueAnswer to one data frame and falseAnswer to another data frame. 但是，如果我将数据子集化为仅包含data $ results为1或2的条目，而又不包含true的条目的数据帧，则可以将trueAnswer应用于一个数据帧，将falseAnswer应用于另一个数据帧。 Then I can rbind the data frames back together. 然后，我可以重新绑定数据帧。 This approach only takes a couple minutes. 这种方法只需要几分钟。

Why is the latter approach using subsetting so much more quicker? 为什么后者使用子集的方法要快得多？ This is a case where this process is applied over many different data sets, so a the former method is too slow to be practical. 在这种情况下，此过程将应用于许多不同的数据集，因此前一种方法太慢而无法实用。 I am just trying to understand what is causing the lack of efficiency in the first approach. 我只是想了解导致第一种方法效率不足的原因。

Answer 1

It is always advisable to provide a fully reproducible & minimal example with sample data . 始终建议提供一个完全可复制且最少的示例数据示例。 That way we can provide specific help based on your sample data. 这样，我们可以根据您的样本数据提供特定的帮助。

In a lot of cases, explicit for loops can be avoided in R, and instead we can make use of optimised vectorised operations. 在很多情况下，可以避免在R中使用显式的for循环，而是可以使用优化的矢量化操作。 For example ifelse is such a vectorised function. 例如ifelse是这样的向量化函数。

Generally the dplyr syntax would be something like this: 通常， dplyr语法如下所示：

library(dplyr);
library(magrittr);
data %>%
    mutate(results = ifelse(input == 1 | input == 2, "1 or 2", "Neither 1 nor 2"))

Update 更新资料

To see how ifelse is vectorised, take a look at ?ifelse . 要查看ifelse是如何矢量化的，请看一下?ifelse 。

Value: 值：

A vector of the same length and attributes (including dimensions and '"class"') as 'test' and data values from the values of 'yes' or 'no'. 一个与“测试”具有相同长度和属性（包括尺寸和“类”）的向量，并且数据值来自“是”或“否”。 [...] [...]

So in other words if ifelse evaluates 100 conditions, the return object will have length 100. 因此，换句话说，如果ifelse评估100个条件，则返回对象的长度为100。

This may lead to the following perhaps surprising/unexpected results: 这可能导致以下可能令人惊讶/意外的结果：

ifelse(c(TRUE), c(100, 200), c(300, 400))
#[1] 100

The return object is element 1 of c(100, 200) because the logical condition has length 1. 返回对象是c(100, 200)元素1，因为逻辑条件的长度为1。

ifelse(c(TRUE, TRUE, TRUE), c(100, 200), c(300, 400))
#[1] 100 200 100

The return object has length 3 because the logical condition has length 3; 返回对象的长度为3，因为逻辑条件的长度为3； since c(100, 200) only has two elements, R needs to recycle entries. 由于c(100, 200)只有两个元素，因此R需要回收条目。

Answer 2

R efficiency is designed around vectors, not loops. R效率是围绕矢量而不是循环设计的。 It is very rare (although it does happen) that a for or while loop is the best way to tackle a problem. 很少（尽管确实会发生）for或while循环是解决问题的最佳方法。 In your case, you would do better to use the vectorized version of if/else: ifelse. 对于您的情况，最好使用if / else的向量化版本：ifelse。 It takes a vector of tests (eg. result %in% 1:2 ) and a 2 vectors of possible responses, depending on the test results. 它需要一个测试向量（例如， result %in% 1:2 ）和两个可能的响应向量，具体取决于测试结果。 All of these have to be the same length. 所有这些必须具有相同的长度。 When you give a answer of length 1, it will extend it to the proper length, otherwise it gives an error. 当您给出长度为1的答案时，它将扩展到适当的长度，否则会出现错误。 Here, it would look like this: 在这里，它看起来像这样：

data$results <- ifelse(results %in% 1:2, trueAnswer, falseAnswer)

通过数据帧迭代R效率

问题描述

2 个解决方案

解决方案1
0 2018-04-15 23:10:25

Update 更新资料

解决方案2
0 2018-04-15 23:13:26

通过数据帧迭代R效率

问题描述

2 个解决方案

解决方案1 0 2018-04-15 23:10:25

Update 更新资料

解决方案2 0 2018-04-15 23:13:26

解决方案1
0 2018-04-15 23:10:25

解决方案2
0 2018-04-15 23:13:26