简体   繁体   中英

adding values using rowSums and tidyverse

I am having some issues trying to sum a bunch of columns in R. I am analyzing a huge dataset so I am reproducing a sample. of fake data.

Here's how the data looks like (I have 800 columns).

library(data.table)
dataset <- data.table(name = c("A", "B", "C", "D"), a1 = 1:4, a2 = c(1,2,NaN,5), a3 = 1:4, a4 = 1:4, a5 = c(1,2,NA,5), a6 = 1:4, a8 = 1:4)
dataset

What I want to do is sum the columns in buckets of 100 columns so, for example, all the values in the first row between the first column and the column 100, all the values in the first row between the column 1 and the column 200, all the values in the second row between the first column and the column 100, etc.

Using the sample data I've come with this solution using rowSums .

dataset %>%
  mutate_if(~!is.numeric(.x), as.numeric) %>%
  mutate_all(funs(replace_na(., 0)))  %>%
  mutate(sum = rowSums(.[,paste("a", 1:3, sep="")])) %>%
  mutate(sum1 = rowSums(.[,paste("a", 4:5, sep="")])) %>%
  mutate(sum2 = rowSums(.[,paste("a", 6:8, sep="")]))

but I am getting the following error:

Error in `[.data.frame`(., , paste("a", 6:8, sep = "")) : undefined columns selected

as the data does not include column a7.

The original data is missing a bunch of columns between a1 and a800 so solving this would be key to make it work.

What would it be the best way to approach and solve this error?

Also, I have a few more questions regarding the code I've written:

  • Is there a smarter way to select the column a1 and a100 instead of using this approach .[,paste("a", 1:3, sep="")] ? I am interested in selected the column by name. I do not want to select it by the position of the column because sometimes a100 does not mean that is the column 100.

  • Also, I am converting the NAs and the NaNs to 0 in order to be able to sum the rows. I am doing it this way mutate_all(funs(replace_na(., 0))) , losing my first row than contains the names of the values. What would it be the best way to replace NA and NaN without mutating the string values of the first row to 0?

  • The type of the columns I am adding is integer as I converted them beforehand mutate_if(~!is.numeric(.x), as.numeric) . Should I follow the same approach in case I have dbl?

Thank you!

Here is one way to do this after transforming data to longer format, for each name , we create a group of n rows and take the sum .

library(dplyr)
library(tidyr)

n <- 2 #No of columns to bucket. Change this to 100 for your case.

dataset %>%
  pivot_longer(cols = -name, names_to = 'col') %>%
  group_by(name) %>%
  group_by(grp = rep(seq_len(n()), each = n, length.out = n()), add = TRUE) %>%
  summarise(value = sum(value, na.rm = TRUE)) %>%
  #If needed in wider format again
  pivot_wider(names_from = grp, values_from = value, names_prefix = 'col')

#  name   col1  col2  col3  col4
#  <chr> <dbl> <dbl> <dbl> <dbl>
#1 A         2     2     2     1
#2 B         4     4     4     2
#3 C         3     6     3     3
#4 D         9     8     9     4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM