简体   繁体   English

R 中的条件行总和

[英]Conditional row sum in R

    a   avalue  b   bvalue
1  12   yes     3   no
2  13   yes     3   yes
3  14   no      2   no
4  NA   no      1   no
5  16   NA      1   yes

I'm trying to count the total number of yes in each row, so the output would be like this:我正在尝试计算每一行中yes的总数,因此 output 将是这样的:

   Count
1  12
2  16
3  0
4  0
5  1

Here is my solution which is not working.这是我的解决方案,它不起作用。 df$count <- rowSums(data[data(3) | data(5) == 'yes',c(2,4)], na.rm=TRUE)

Edit:编辑:

OP has edited the post to include headers on the input data, and judging from the comments, it seems that OP wants the solution to scale to multiple column pairs. OP 编辑了帖子以在输入数据中包含标题,从评论来看,OP 似乎希望解决方案扩展到多列对。 Here's a solution in base R that should do that:这是基础 R 中的一个解决方案,应该这样做:

raw <- "
   a   avalue  b   bvalue
1  12   yes     3   no
2  13   yes     3   yes
3  14   no      2   no
4  NA   no      1   no
5  16   NA      1   yes "

df <- read.table(text = raw, header = TRUE)

use <- endsWith(colnames(df), "value")
df[use] <- ifelse(df[use] == "yes", TRUE, FALSE)
df[is.na(df)] <- 0
rowSums(df[use] * df[!use])
#>  1  2  3  4  5 
#> 12 16  0  0  1

Created on 2021-02-20 by the reprex package (v0.3.0)代表 package (v0.3.0) 于 2021 年 2 月 20 日创建

Original post:原帖:

Another take:另一种做法:

raw <- "1  12   yes     3   no
2  13   yes     3   yes
3  14   no      2   no
4  NA   no      1   no
5  16   NA      1   yes"

df <- read.table(text = raw)

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
})

df %>%
  setNames(c("row", "value_first", "use_first", "value_second", "use_second")) %>%
  pivot_longer(!row, names_to = c(".value", "column"), names_sep = "_") %>%
  replace_na(list(value = 0, use = "no")) %>%
  group_by(row) %>%
  summarise(total = sum(value * (use == "yes")))
#> # A tibble: 5 x 2
#>     row total
#> * <int> <dbl>
#> 1     1    12
#> 2     2    16
#> 3     3     0
#> 4     4     0
#> 5     5     1

Created on 2021-02-18 by the reprex package (v0.3.0)reprex package (v0.3.0) 于 2021 年 2 月 18 日创建

Or using base R, you can simply do element-wise multiplication for the rows that satisfy your condition on the value column, and then apply rowSums() :或者使用基础 R,您可以简单地对值列上满足条件的行进行元素乘法,然后应用rowSums()

raw <- "1  12   yes     3   no
2  13   yes     3   yes
3  14   no      2   no
4  NA   no      1   no
5  16   NA      1   yes"

df <- read.table(text = raw)

rowSums((!is.na(df[,c(3,5)])&df[,c(3,5)]=="yes") * df[,c(2,4)], na.rm=TRUE)
#> [1] 12 16  0  0  1

## Explanation:
# 1) Select relevant rows
(rows_select <- !is.na(df[,c(3,5)])&df[,c(3,5)]=="yes")
#>         V3    V5
#> [1,]  TRUE FALSE
#> [2,]  TRUE  TRUE
#> [3,] FALSE FALSE
#> [4,] FALSE FALSE
#> [5,] FALSE  TRUE

# 2) multiply by the columns with the data:
(rows_sel_val <- rows_select * df[,c(2,4)])
#>   V2 V4
#> 1 12  0
#> 2 13  3
#> 3  0  0
#> 4 NA  0
#> 5  0  1

# 3) Apply rowSums
rowSums(rows_sel_val, na.rm=TRUE)
#> [1] 12 16  0  0  1

Created on 2021-02-18 by the reprex package (v1.0.0)代表 package (v1.0.0) 于 2021 年 2 月 18 日创建

1) Create a new data frame df0 that has 0 where each NA in df is and then use the indicated formula on it. 1)创建一个新的数据框 df0 ,其中 df 中的每个 NA 为 0,然后在其上使用指示的公式。 No packages are used.不使用任何包。

df0 <- replace(df, is.na(df), 0)
transform(df, count = with(df0, a * (avalue == "yes") + b * (bvalue == "yes")))

giving:给予:

   a avalue b bvalue count
1 12    yes 3     no    12
2 13    yes 3    yes    16
3 14     no 2     no     0
4 NA     no 1     no     0
5 16   <NA> 1    yes     1

2) or if there are more than just a and b then this gives the same result but handles any number of columns. 2)或者如果不仅仅是 a 和 b 那么这会给出相同的结果,但可以处理任意数量的列。 ok picks out the a, b, etc. columns and,ok picks out the avalue, bvalue. ok 挑选出 a、b 等列,ok 挑选出 avalue、bvalue。 etc. columns.等栏目。 Note that R will automatically recycle ok and !ok to a length equal to the number of columns.请注意,R 将自动回收 ok 和 !ok 到等于列数的长度。

ok <- c(TRUE, FALSE)
transform(df, count = rowSums(df[ok] * (df[!ok] == "yes"), na.rm = TRUE))

2a) Using the collapse package, a variation on (2) is to use num_vars and cat_vars which pick out the numeric and categorical columns. 2a)使用折叠 package,(2) 的一个变体是使用 num_vars 和 cat_vars 来挑选数字和分类列。

Note that if any of the numeric columns are all NA then they must be set using NA_real_ or NA_integer_ and not just NA since num_vars is extracting columns by type.请注意,如果任何数字列都是 NA,那么它们必须使用 NA_real_ 或 NA_integer_ 设置,而不仅仅是 NA,因为 num_vars 按类型提取列。 This can be checked by ensuring that logi_vars(df) has no columns (since an ordinary NA is logical) or else just use (2) if it is possible that any column is all NA.这可以通过确保 logi_vars(df) 没有列来检查(因为普通的 NA 是逻辑的),或者如果任何列都可能是 NA,则只需使用 (2)。

library(collapse)

transform(df, count = rowSums(num_vars(df0) * (cat_vars(df0) == "yes"), na.rm = TRUE))

Note笔记

The input in reproducible form is:可重现形式的输入是:

Lines <- "
    a   avalue  b   bvalue
1  12   yes     3   no
2  13   yes     3   yes
3  14   no      2   no
4  NA   no      1   no
5  16   NA      1   yes"
df <- read.table(text = Lines)

I'd tidy my data and calculate the sum by group, using the tidyverse:我会整理我的数据并使用 tidyverse 按组计算总和:

library(tidyverse)
df<-read.table(text = "1  12   yes     3   no
2  13   yes     3   yes
3  14   no      2   no
4  NA   no      1   no
5  16   NA      1   yes")

bind_rows(df[1:3], setNames(df[c(1,4:5)], paste0("V",1:3))) %>%
group_by(V1, V3) %>%
summarise(sum(V2, na.rm = TRUE))

#> Groups:   V1 [5]
#>     V1 V3    `sum(V2, na.rm = TRUE)`
#>  <int> <chr>                   <int>
#>1     1 no                          3
#>2     1 yes                        12
#>3     2 yes                        16
#>4     3 no                         16
#>5     4 no                          1
#>6     5 yes                         1
#>7     5 <NA>                       16

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM