简体   繁体   English

根据R中另一列的条件求和

[英]Sum a column based on condition in another column in R

I have a dataset with 150+ columns and 1000s of rows. 我有150个以上的列和1000个行的数据集。 The dataset provides a flag for different items for various categories in different columns. 数据集为不同列中不同类别的不同项目提供了标志。 One of the columns is total usage for each item across the category. 列之一是类别中每个项目的总使用量。 Below is sample of the dataset: 以下是数据集的示例:

Values   A B C
1        Y   
2          Y
3        Y   Y 
4            Y 

I want to use R to do calculations such that I get the following results: 我想使用R进行计算,以获得以下结果:

     Count  Sum
A      2     4
B      1     2
C      2     7

Basically I want the Count Column to give me the number of "y" for A, B and C, and the Sum column to give me sum from the Usage column for each time there is a "Y" in Columns A, B and C 基本上,我希望计数列为我提供A,B和C的“ y”数,而总和列为每次在A,B和C列中有“ Y”的情况下从“用法”列中给我求和

Step 2 - I have similar column values in 200 + files. 第2步-在200多个文件中,我具有相似的列值。 I have brought all the files into a folder. 我把所有文件都放进了一个文件夹。 What I would like to do is use the above functions, apply it in each of the file, and then have the answer grouped by file and category. 我想做的是使用上述功能,将其应用到每个文件中,然后将答案按文件和类别分组。 for example 例如

File 1 Count A Sum A Count B Sum B Count C Sum C

File 2 Count A Sum A Count B Sum B Count C Sum C

and so on 等等

Here's one simple (step-by-step) solution 这是一个简单的(逐步的)解决方案

# First, readind your data
> df <- read.table(text="Values   A  B  C
+ 1        Y  NA NA
+ 2        NA Y  NA
+ 3        Y  NA Y 
+ 4        NA NA Y ", header=TRUE)
> 
> Count <- colSums(!is.na(df[, -1]))
> Sum <- apply(!is.na(df[,-1]), 2, function(x) sum(df$Values[x]))
> data.frame(Count, Sum)
  Count Sum
A     2   4
B     1   2
C     2   7

An alternative using tidyr and dplyr : 使用tidyrdplyr的替代方法:

library(tidyr)
library(dplyr)
df %>% gather(id, vals, -Values) %>% group_by(id) %>%
        summarise(Count = sum(vals=="Y"), 
                  Sum = sum(Values[vals=="Y"]))
#      id Count   Sum
#  (fctr) (int) (int)
#1      A     2     4
#2      B     1     2
#3      C     2     7

Data 数据

df <- structure(list(Values = 1:4, A = structure(c(2L, 1L, 2L, 1L), .Label = c("", 
"Y"), class = "factor"), B = structure(c(1L, 2L, 1L, 1L), .Label = c("", 
"Y"), class = "factor"), C = structure(c(1L, 1L, 2L, 2L), .Label = c("", 
"Y"), class = "factor")), .Names = c("Values", "A", "B", "C"), class = "data.frame", row.names = c(NA, 
-4L))

Here is a data.table approach. 这是一个数据data.table方法。 Convert the 'data.frame' to 'data.table' ( setDT(df1) ), melt to 'long' format, grouped by 'id', get the sum of 'value' that are "Y" to get the 'Count', subset the 'Values' that corresponds to "Y" element in 'value', sum it to get the "Sum". 将'data.frame'转换为'data.table'( setDT(df1) ), melt为'long'格式,按'id'分组,获取'Y'的'value' sum ,以获取'Count' ',将与“值”中“ Y”元素相对应的“值”子集,将其sum即可获得“总和”。

library(data.table)
melt(setDT(df1), id.var="Values", variable.name="id")[, {
           i1 <- value == "Y"
           .(Count = sum(i1), Sum = sum(Values[i1]))
           } ,  by = id]
#   id Count Sum
#1:  A     2   4
#2:  B     1   2
#3:  C     2   7

Sometimes it's easiest to just build a new data.frame from calculations on the old one: 有时,最简单的方法是根据对旧数据的计算来构建新的data.frame:

# read in data
df <- read.table(text = 'Values   A B C
                         1        Y N N 
                         2        N Y N
                         3        Y N Y 
                         4        N N Y', header = TRUE)

data.frame(Count = colSums(df[,-1] == 'Y'),    # count of "Y"s in each column
           # sum of Values column where A/B/C is "Y"
           Sum = sapply(df[,-1], function(x){sum(df$Values[x == 'Y'])}))

#   Count Sum
# A     2   4
# B     1   2
# C     2   7

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM