简体   繁体   English

使用 rowSums 改变 dplyr 中的列

[英]Mutating column in `dplyr` using `rowSums`

Recently I stumbled uppon a strange behaviour of dplyr and I would be happy if somebody would provide some insights.最近我偶然发现了dplyr的奇怪行为,如果有人能提供一些见解,我会很高兴。

Assuming I have a data of which com columns contain some numerical values.假设我有一个数据,其中 com 列包含一些数值。 In an easy scenario I would like to compute rowSums .在一个简单的场景中,我想计算rowSums Although there are many ways to do it, here are two examples:虽然有很多方法可以做到,这里有两个例子:

df <- data.frame(matrix(rnorm(20), 10, 2),
                 ids = paste("i", 1:20, sep = ""),
                 stringsAsFactors = FALSE)

# works
dplyr::select(df, - ids) %>% {rowSums(.)}

# does not work
# Error: invalid argument to unary operator
df %>%
  dplyr::mutate(blubb = dplyr::select(df, - ids) %>% {rowSums(.)})

# does not work
# Error: invalid argument to unary operator
df %>%
  dplyr::mutate(blubb = dplyr::select(., - ids) %>% {rowSums(.)})

# workaround:
tmp <- dplyr::select(df, - ids) %>% {rowSums(.)}
df %>%
  dplyr::mutate(blubb = tmp)

# works
rowSums(dplyr::select(df, - ids))

# does not work
# Error: invalid argument to unary operator
df %>%
  dplyr::mutate(blubb = rowSums(dplyr::select(df, - ids)))

# workaround
tmp <- rowSums(dplyr::select(df, - ids))
df %>%
  dplyr::mutate(blubb = tmp)

First, I don't really understand what is causing the error and second I would like to know how to actually achieve a tidy computation of some (viable) columns in a tidy way.首先,我真的不明白是什么导致了错误,其次我想知道如何以一种整洁的方式实际实现一些(可行的)列的整洁计算。

edit编辑

The question mutate and rowSums exclude columns , although related, focuses on using rowSums for computation.问题mutate and rowSums exclude columns尽管相关,但重点在于使用rowSums进行计算。 Here I'm eager to understand why the upper examples do not work.在这里,我很想了解为什么上面的例子不起作用。 It is not so much about how to solve (see the workarounds) but to understand what happens when the naive approach is applied.与其说是如何解决(请参阅解决方法),不如说是了解应用朴素方法时会发生什么。

The examples do not work because you are nesting select in mutate and using bare variable names.这些示例不起作用,因为您在mutate中嵌套select并使用裸变量名称。 In this case, select is trying to do something like在这种情况下, select正在尝试执行类似的操作

> -df$ids
Error in -df$ids : invalid argument to unary operator

which fails because you can't negate a character string (ie -"i1" or -"i2" makes no sense).失败是因为您无法否定字符串(即-"i1"-"i2"没有意义)。 Either of the formulations below works:以下任一配方均有效:

df %>% mutate(blubb = rowSums(select_(., "X1", "X2")))
df %>% mutate(blubb = rowSums(select(., -3)))

or或者

df %>% mutate(blubb = rowSums(select_(., "-ids")))

as suggested by @Haboryme.正如@Haboryme 所建议的那样。

select_ is deprecated . select_弃用 You can use:您可以使用:

library(dplyr)
df <- data.frame(matrix(rnorm(20), 10, 2),
                 ids = paste("i", 1:20, sep = ""),
                 stringsAsFactors = FALSE)
df %>% 
  mutate(blubb = rowSums(select(., .dots = c("X1", "X2"))))

# Or more generally:
desired_columns <- c("X1", "X2")
df %>% 
  mutate(blubb = rowSums(select(., .dots = all_of(desired_columns))))

select can now accept bare column names so no need to use .dots or select_ which has been deprecated. select现在可以接受裸列名称,因此无需使用已弃用的.dotsselect_

Here are few of the approaches that can work now.以下是一些现在可行的方法。

library(dplyr)

#sum all the columns except `id`. 
df %>% mutate(blubb = rowSums(select(., -ids), na.rm = TRUE))

#sum X1 and X2 columns
df %>% mutate(blubb = rowSums(select(., X1, X2), na.rm = TRUE))

#sum all the columns that start with 'X'
df %>% mutate(blubb = rowSums(select(., starts_with('X')), na.rm = TRUE))

#sum all the numeric columns
df %>% mutate(blubb = rowSums(select(., where(is.numeric))))

Adding to this old thread because I searched on this question then realized I was asking the wrong question.添加到这个旧线程是因为我搜索了这个问题然后意识到我问错了问题。 Also, I detect some yearning in this and related questions for the proper pipe steps way to do this.此外,我在这个和相关问题中发现了一些对正确管道步骤方法的渴望。

The answers here are somewhat non-intuitive because they are trying to use the dplyr vernacular with non-"tidy" data.这里的答案有些不直观,因为他们试图将 dplyr 白话与非“整洁”数据一起使用。 IF you want to do it the dplyr way, make the data tidy first, using gather() , and then use summarise()如果您想以 dplyr 方式进行操作,请先使用 Gather gather()整理数据,然后使用summarise()

library(tidyverse)

df <- data.frame(matrix(rnorm(20), 10, 2),
                 ids = paste("i", 1:20, sep = ""),
                 stringsAsFactors = FALSE)

df %>% gather(key=Xn,value="value",-ids) %>% 
  group_by(ids) %>% 
  summarise(rowsum=sum(value))

#> # A tibble: 20 x 2
#>    ids   rowsum
#>    <chr>       <dbl>
#>  1 i1          0.942
#>  2 i10        -0.330
#>  3 i11         0.942
#>  4 i12        -0.721
#>  5 i13         2.50 
#>  6 i14        -0.611
#>  7 i15        -0.799
#>  8 i16         1.84 
#>  9 i17        -0.629
#> 10 i18        -1.39 
#> 11 i19         1.44 
#> 12 i2         -0.721
#> 13 i20        -0.330
#> 14 i3          2.50 
#> 15 i4         -0.611
#> 16 i5         -0.799
#> 17 i6          1.84 
#> 18 i7         -0.629
#> 19 i8         -1.39 
#> 20 i9          1.44

If you care about the order of the ids when they are not sortable using arrange() , make that column a factor first.如果您关心 id 无法使用arrange()排序时的顺序,请先将该列作为一个因素。

  df %>% 
  mutate(ids=as_factor(ids)) %>% 
  gather(key=Xn,value="value",-ids) %>% 
  group_by(ids) %>% 
  summarise(rowsum=sum(value))

Why do you want to use the pipe operator?为什么要使用管道运算符? Just write an expression such as:只需写一个表达式,例如:

rowSums(df[,sapply(df, is.numeric)])

ie calculate the rowsums on all the numeric columns, with the advantage of not needing to specify ids .即计算所有数字列的行和,优点是不需要指定ids

If you want to save your results as a column within data, you can use data.table syntax like this:如果要将结果保存为数据中的列,可以使用如下所示的 data.table 语法:

dt <- as.data.table(df)
dt[, x3 := rowSums(.SD, na.rm=T), .SDcols = which(sapply(dt, is.numeric))]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM