简体   繁体   English

使用 rowSums 和 tidyverse 添加值

[英]adding values using rowSums and tidyverse

I am having some issues trying to sum a bunch of columns in R. I am analyzing a huge dataset so I am reproducing a sample.我在尝试对 R 中的一堆列求和时遇到了一些问题。我正在分析一个巨大的数据集,所以我正在复制一个样本。 of fake data.假数据。

Here's how the data looks like (I have 800 columns).这是数据的样子(我有 800 列)。

library(data.table)
dataset <- data.table(name = c("A", "B", "C", "D"), a1 = 1:4, a2 = c(1,2,NaN,5), a3 = 1:4, a4 = 1:4, a5 = c(1,2,NA,5), a6 = 1:4, a8 = 1:4)
dataset

What I want to do is sum the columns in buckets of 100 columns so, for example, all the values in the first row between the first column and the column 100, all the values in the first row between the column 1 and the column 200, all the values in the second row between the first column and the column 100, etc.我想要做的是对 100 列的桶中的列求和,例如,第一列和第 100 列之间的第一行中的所有值,第 1 列和第 200 列之间的第一行中的所有值,第二行中第一列和第 100 列之间的所有值等。

Using the sample data I've come with this solution using rowSums .使用我使用rowSums随此解决方案提供的示例数据。

dataset %>%
  mutate_if(~!is.numeric(.x), as.numeric) %>%
  mutate_all(funs(replace_na(., 0)))  %>%
  mutate(sum = rowSums(.[,paste("a", 1:3, sep="")])) %>%
  mutate(sum1 = rowSums(.[,paste("a", 4:5, sep="")])) %>%
  mutate(sum2 = rowSums(.[,paste("a", 6:8, sep="")]))

but I am getting the following error:但我收到以下错误:

Error in `[.data.frame`(., , paste("a", 6:8, sep = "")) : undefined columns selected

as the data does not include column a7.因为数据不包括 a7 列。

The original data is missing a bunch of columns between a1 and a800 so solving this would be key to make it work.原始数据缺少 a1 和 a800 之间的一堆列,因此解决这个问题是使其工作的关键。

What would it be the best way to approach and solve this error?接近和解决此错误的最佳方法是什么?

Also, I have a few more questions regarding the code I've written:另外,我还有一些关于我编写的代码的问题:

  • Is there a smarter way to select the column a1 and a100 instead of using this approach .[,paste("a", 1:3, sep="")] ?有没有更聪明的方法来选择列 a1 和 a100 而不是使用这种方法.[,paste("a", 1:3, sep="")] I am interested in selected the column by name.我有兴趣按名称选择列。 I do not want to select it by the position of the column because sometimes a100 does not mean that is the column 100.我不想通过列的位置选择它,因为有时 a100 并不意味着是第 100 列。

  • Also, I am converting the NAs and the NaNs to 0 in order to be able to sum the rows.此外,我将 NA 和 NaN 转换为 0,以便能够对行求和。 I am doing it this way mutate_all(funs(replace_na(., 0))) , losing my first row than contains the names of the values.我正在这样做mutate_all(funs(replace_na(., 0))) ,失去了我的第一行而不是包含值的名称。 What would it be the best way to replace NA and NaN without mutating the string values of the first row to 0?在不将第一行的字符串值更改为 0 的情况下替换 NA 和 NaN 的最佳方法是什么?

  • The type of the columns I am adding is integer as I converted them beforehand mutate_if(~!is.numeric(.x), as.numeric) .我添加的列的类型是整数,因为我事先转换了它们mutate_if(~!is.numeric(.x), as.numeric) Should I follow the same approach in case I have dbl?如果我有 dbl,我应该遵循相同的方法吗?

Thank you!谢谢!

Here is one way to do this after transforming data to longer format, for each name , we create a group of n rows and take the sum .这是将数据转换为更长格式后执行此操作的一种方法,对于每个name ,我们创建一组n行并取sum

library(dplyr)
library(tidyr)

n <- 2 #No of columns to bucket. Change this to 100 for your case.

dataset %>%
  pivot_longer(cols = -name, names_to = 'col') %>%
  group_by(name) %>%
  group_by(grp = rep(seq_len(n()), each = n, length.out = n()), add = TRUE) %>%
  summarise(value = sum(value, na.rm = TRUE)) %>%
  #If needed in wider format again
  pivot_wider(names_from = grp, values_from = value, names_prefix = 'col')

#  name   col1  col2  col3  col4
#  <chr> <dbl> <dbl> <dbl> <dbl>
#1 A         2     2     2     1
#2 B         4     4     4     2
#3 C         3     6     3     3
#4 D         9     8     9     4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM