简体   繁体   English

基于数据框中的两个旧列创建新列

[英]Creating a new column based on two old columns in a data frame

data <- data.frame(foo = c(0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1),
                   bar = c(1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0))

Hi, Here I am having a data frame with two columns foo and bar. 嗨,这里我有一个包含两列foo和bar的数据框。 I want to create a new column Complete, based on foo and bar data. 我想基于foo和bar数据创建一个新列Complete。

  • If foo and bar is zero then complete should be 0. 如果foo和bar为零,则complete应为0。
  • If foo is one and bar is 0 then complete should be one. 如果foo是1而bar是0那么完成应该是1。
  • If bar is 1 and foo is 0 then complete should be two. 如果bar是1而foo是0那么完成应该是2。

For example. 例如。

foo   bar complete
0     0   0
1     0   1
0     1   2

Edit: 编辑:

If foo==1 and bar==1 then NA . 如果foo==1bar==1NA

Following suit, using NA when both columns are 1. Start with the row sums. 接下来,当两列都是1时使用NA 。从行总和开始。 If any of them are 2 (the number of columns), replace it with NA . 如果其中任何一个为2(列数),请将其替换为NA Then multiply that by the max.col() value. 然后乘以max.col()值。

rs <- rowSums(data)
cbind(data, complete = max.col(data) * replace(rs, rs == 2, NA))
#    foo bar complete
# 1    0   1        2
# 2    1   0        1
# 3    0   0        0
# 4    0   0        0
# 5    1   1       NA
# 6    0   0        0
# 7    0   1        2
# 8    0   0        0
# 9    1   0        1
# 10   1   1       NA
# 11   1   0        1

If you don't wish to assign new objects, you can use a local environment or wrap it up into a function: 如果您不希望分配新对象,可以使用本地环境或将其包装到函数中:

local({
    rs <- rowSums(data)
    max.col(data) * replace(rs, rs == 2, NA)
})
# [1]  2  1  0  0 NA  0  2  0  1 NA  1

If an algebraic approach is sought, we can try one of the lines below: 如果寻求代数方法,我们可以尝试下面的一行:

with(data, 2L * bar + foo + 0L * NA^(bar & foo))
with(data, 2L * bar + foo + NA^(bar & foo) - 1L)
with(data, (2L * bar + foo) * NA^(bar & foo))

All return 全部归来

 [1] 2 1 0 0 NA 0 2 0 1 NA 1 

Explanation 说明

The expression 2L * bar + foo is treating bar and foo as digits of a binary number. 表达式2L * bar + foobarfoo视为二进制数的数字。 The difficulty is to return NA in case of foo == 1 & bar == 1 . 难度是在foo == 1 & bar == 1情况下返回NA For that, bar and foo are treated as logical values. 为此, barfoo被视为逻辑值。 If both are 1 , ie, TRUE then NA^(bar & foo) returns NA , otherwise 1 . 如果两者都是1 ,即TRUENA^(bar & foo)返回NA ,否则返回1

If one operand of an expression is NA so is the overall expression. 如果表达式的一个操作数是NA那么整个表达式。 So, there are several possibilities to combine NA^(bar & foo) with 2L * bar + foo . 因此,有几种可能性将NA^(bar & foo)2L * bar + foo结合起来。 I wonder which is the fastest. 我想知道哪个是最快的。

Benchmark 基准

So far, 7 different approaches have been posted by 到目前为止,已经发布了7种不同的方法

The OP has supplied his sample data as type double . OP已将其样本数据提供为double类型。 As I have seen remarkable different timings for integer and double values on other occasions, the benchmark runs will be repeated for each type to investigate the impact of data type on the different approaches. 正如我在其他场合看到的integerdouble值的显着不同时序,将针对每种类型重复基准运行,以研究数据类型对不同方法的影响。

Benchmark data 基准数据

The benchmark data will consist of 1 million rows: 基准数据将包含100万行:

n_row <- 1e6L
set.seed(1234L)
data_int <- data.frame(foo = sample(0:1, n_row, replace = TRUE),
                       bar = sample(0:1, n_row, replace = TRUE))
with(data_int, table(foo, bar))
  bar foo 0 1 0 249978 250330 1 249892 249800 
data_dbl <- data.frame(foo = as.double(data_int$foo),
                       bar = as.double(data_int$bar))

Benchmark code 基准代码

For benchmarking, the microbenchmark package is used. 对于基准测试,使用microbenchmark软件包。

# define check function to compare results
check <- function(values) {
  all(sapply(values[-1], function(x) all.equal(values[[1]], x)))
}

library(dplyr)
data <- data_dbl
microbenchmark::microbenchmark(
  d.b = {
    vect = c("0 0" = 0, "1 0" = 1, "0 1" = 2)
    unname(vect[match(with(data, paste(foo, bar)), names(vect))])
  },
  Balter = with(data,ifelse(foo == 0 & bar == 0, 0,
                            ifelse(foo == 1 & bar == 0, 1,
                                   ifelse(foo == 0 & bar == 1, 2, NA)))),
  PoGibas = with(data, case_when(foo == 0 & bar == 0 ~ 0,
                                   foo == 1 & bar == 0 ~ 1,
                                   foo == 0 & bar == 1 ~ 2)),
  Rich = local({rs = rowSums(data);  max.col(data) * replace(rs, rs == 2, NA)}),
  Frank = with(data, ifelse(xor(foo, bar), max.col(data), 0*NA^foo)),
  user20650 = with(data, c(0, 1, 2, NA)[c(2*bar + foo + 1)]),
  uwe1i = with(data, 2L * bar + foo + 0L * NA^(bar & foo)),
  uwe1d = with(data, 2  * bar + foo + 0  * NA^(bar & foo)),
  uwe2i = with(data, 2L * bar + foo + NA^(bar & foo) - 1L),
  uwe2d = with(data, 2  * bar + foo + NA^(bar & foo) - 1),
  uwe3i = with(data, (2L * bar + foo) * NA^(bar & foo)),
  uwe3d = with(data, (2  * bar + foo) * NA^(bar & foo)),
  times = 11L,
  check = check)

Note that only the result vector is created without creating a new column in data . 请注意,只创建结果向量而不data创建新列。 The approach of PoGibas was modified accordingly. 相应地修改了PoGibas的方法。

As mentioned above, there might be speed differences in using integer or double values. 如上所述,使用integerdouble值可能存在速度差异。 Therefore, I wanted to test also the effect of using integer constant, eg, 0L, 1L , versus double constants 0, 1 . 因此,我还想测试使用整数常量(例如0L, 1L )与双常数0, 1

Benchmark results 基准测试结果

First, for input data of type double : 首先,对于double类型的输入数据:

 Unit: milliseconds expr min lq mean median uq max neval cld db 1687.05063 1700.52197 1707.72896 1706.48511 1715.46814 1730.62160 11 e Balter 287.89649 377.42284 412.59764 452.75668 458.21178 472.92971 11 d PoGibas 152.90900 154.82164 176.09522 158.23214 165.73524 333.48223 11 c Rich 67.43862 68.68331 76.42759 77.10620 82.42179 89.90016 11 b Frank 170.78293 174.66258 192.85203 179.69422 184.55237 333.74578 11 c user20650 20.11790 20.29744 22.32541 20.81453 21.11509 34.45654 11 a uwe1i 24.86296 25.13935 28.38634 25.60604 28.79395 45.53514 11 a uwe1d 24.90034 25.05439 28.62943 25.41460 29.47379 41.08459 11 a uwe2i 25.21222 25.59754 30.15579 26.29135 33.00361 47.13382 11 a uwe2d 24.38305 25.09385 29.46715 25.41951 29.11112 45.05486 11 a uwe3i 23.27334 23.95714 27.12474 24.28073 25.86336 44.40467 11 a uwe3d 23.23332 23.65073 27.60330 23.96620 29.53911 40.41175 11 a 

Now, for input data of type integer : 现在,对于integer类型的输入数据:

 Unit: milliseconds expr min lq mean median uq max neval cld db 591.71859 596.31904 607.51452 601.24232 617.13886 636.51405 11 e Balter 284.08896 297.06170 374.42691 303.14888 465.27859 488.19606 11 d PoGibas 151.75851 155.28304 174.31369 159.18364 163.50864 329.00412 11 c Rich 67.79770 71.22311 78.38562 77.46642 84.56777 96.55540 11 b Frank 166.60802 170.34078 192.19833 180.09257 182.43584 350.86681 11 c user20650 19.79204 20.06220 21.95963 20.18624 20.42393 30.13135 11 a uwe1i 27.54680 27.83169 32.36917 28.08939 37.82286 45.21722 11 ab uwe1d 22.60162 22.89350 25.94329 23.10419 23.74173 47.39435 11 a uwe2i 27.05104 27.57607 27.80843 27.68122 28.02048 28.88193 11 a uwe2d 22.83384 22.93522 23.22148 23.12231 23.41210 24.18633 11 a uwe3i 25.17371 26.44427 29.34889 26.68290 27.08276 47.71379 11 a uwe3d 21.68712 21.83060 26.16276 22.37659 28.40750 43.33989 11 a 

For both integer and double input values, the approach by user20650 is the fastest. 对于integerdouble输入值, user20650的方法是最快的。 Next are my algebraic approaches. 接下来是我的代数方法。 Third is Rich s solution but three times slower than the second. 第三是Rich的解决方案,但比第二个慢三倍。

The type of input data has the strongest impact on db 's solution and to a lesser extent on Balter 's. 输入数据的类型对db的解决方案影响最大,而对Balter的解决方案影响较小。 The other solutions seem to be rather invariant. 其他解决方案似乎相当不变。

Interestingly, there seems to be no remarkable difference from using integer or double constants in my algebraic solutions. 有趣的是,在我的代数解决方案中使用integerdouble常数似乎没有显着差异。

You can create a named vector ( vect in this example) and lookup values from that vector using match 您可以创建命名向量(在此示例中为vect )并使用match从该向量中查找值

vect = c("0 0" = 0, "1 0" = 1, "0 1" = 2)
unname(vect[match(with(data, paste(foo, bar)), names(vect))])
# [1]  2  1  0  0 NA  0  2  0  1 NA  1

There's a lot of ways to do this, some more efficient depending on how many conditions you have. 有很多方法可以做到这一点,一些更有效,取决于你有多少条件。 But a basic way is: 但一个基本的方法是:

data$New_Column <- with(data,ifelse(foo == 0 & bar == 0, 0,
                         ifelse(foo == 1 & bar == 0, 1,
                         ifelse(foo == 0 & bar == 1, 2, NA))))

#   foo bar New_Column
#1    0   1          2
#2    1   0          1
#3    0   0          0
#4    0   0          0
#5    1   1         NA
#6    0   0          0
#7    0   1          2
#8    0   0          0
#9    1   0          1
#10   1   1         NA
#11   1   0          1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据现有列在数据框中创建新的累积列 - Creating a new accumulative column in a data frame based on existing columns 根据 R 中数据框中的其他列创建新列 - Creating new column based on other columns in data frame in R 通过根据另一个数据框中列的值从一个数据框中提取列来创建新数据框 - creating a new data frame by extracting columns from one data frame based on the value of column in another data frame 根据另外两列的值创建一个新的数据框列 - Create a new data frame column based on the values of two other columns 合并数据框中的两列,并在R中的现有数据框中创建新列 - Combining two columns in a data frame and creating a new column in an existing data frame in R 比较数据框中的 2 列并在数据框中创建新列 - comparing 2 columns in a data-frame and creating a new column in data frame 根据两个字符列之间的差异创建R data.frame列 - Creating an R data.frame column based on the difference between two character columns R数据帧计算两列之间的值数并创建新列 - R Data Frame Counting number of values between two columns and creating new column R-当两列或多列连续匹配时,在数据框中创建新列 - R - Creating a new column within a data frame when two or more columns are a match in a row 基于其他两个列映射新数据框列的最快方法 - Fastest way to map a new data frame column based on two other columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM