简体   繁体   English

"使用 dplyr 将函数应用于表的每一行?"

[英]Applying a function to every row of a table using dplyr?

When working with plyr I often found it useful to use adply for scalar functions that I have to apply to each and every row.在使用plyr时,我经常发现将adply用于我必须应用于每一行的标量函数很有用。

eg例如

data(iris)
library(plyr)
head(
     adply(iris, 1, transform , Max.Len= max(Sepal.Length,Petal.Length))
    )
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Max.Len
1          5.1         3.5          1.4         0.2  setosa     5.1
2          4.9         3.0          1.4         0.2  setosa     4.9
3          4.7         3.2          1.3         0.2  setosa     4.7
4          4.6         3.1          1.5         0.2  setosa     4.6
5          5.0         3.6          1.4         0.2  setosa     5.0
6          5.4         3.9          1.7         0.4  setosa     5.4

Now I'm using dplyr more, I'm wondering if there is a tidy/natural way to do this?现在我更多地使用dplyr ,我想知道是否有一种整洁/自然的方式来做到这一点? As this is NOT what I want:因为这不是我想要的:

library(dplyr)
head(
     mutate(iris, Max.Len= max(Sepal.Length,Petal.Length))
    )
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Max.Len
1          5.1         3.5          1.4         0.2  setosa     7.9
2          4.9         3.0          1.4         0.2  setosa     7.9
3          4.7         3.2          1.3         0.2  setosa     7.9
4          4.6         3.1          1.5         0.2  setosa     7.9
5          5.0         3.6          1.4         0.2  setosa     7.9
6          5.4         3.9          1.7         0.4  setosa     7.9

As of dplyr 0.2 (I think) rowwise() is implemented, so the answer to this problem becomes:从 dplyr 0.2 (我认为) rowwise() ,这个问题的答案就变成了:

iris %>% 
  rowwise() %>% 
  mutate(Max.Len= max(Sepal.Length,Petal.Length))

Non rowwise alternativerowwise替代

Five years (!) later this answer still gets a lot of traffic.五年(!)之后,这个答案仍然获得了大量流量。 Since it was given, rowwise is increasingly not recommended, although lots of people seem to find it intuitive.自从给出它以来,越来越不推荐rowwise ,尽管很多人似乎觉得它很直观。 Do yourself a favour and go through Jenny Bryan's Row-oriented workflows in R with the tidyverse material to get a good handle on this topic.帮自己一个忙,使用 tidyverse材料在 R 中完成 Jenny Bryan 的面向行的工作流程,以很好地处理这个主题。

The most straightforward way I have found is based on one of Hadley's examples using pmap :我发现的最直接的方法是基于 Hadley 使用pmap的示例之一:

iris %>% 
  mutate(Max.Len= purrr::pmap_dbl(list(Sepal.Length, Petal.Length), max))

Using this approach, you can give an arbitrary number of arguments to the function ( .f ) inside pmap .使用这种方法,您可以为pmap的函数 ( .f ) 提供任意数量的参数。

pmap is a good conceptual approach because it reflects the fact that when you're doing row wise operations you're actually working with tuples from a list of vectors (the columns in a dataframe). pmap是一种很好的概念方法,因为它反映了这样一个事实,即当您进行行明智的操作时,您实际上是在使用向量列表(数据帧中的列)中的元组。

The idiomatic approach will be to create an appropriately vectorised function.惯用的方法是创建一个适当的矢量化函数。

R provide pmax which is suitable here, however it also provides Vectorize as a wrapper for mapply to allow you to create a vectorised arbitrary version of an arbitrary function. R提供了适用于此处的pmax ,但它还提供Vectorize作为mapply的包装器,以允许您创建任意函数的矢量化任意版本。

library(dplyr)
# use base R pmax (vectorized in C)
iris %>% mutate(max.len = pmax(Sepal.Length, Petal.Length))
# use vectorize to create your own function
# for example, a horribly inefficient get first non-Na value function
# a version that is not vectorized
coalesce <- function(a,b) {r <- c(a[1],b[1]); r[!is.na(r)][1]}
# a vectorized version
Coalesce <- Vectorize(coalesce, vectorize.args = c('a','b'))
# some example data
df <- data.frame(a = c(1:5,NA,7:10), b = c(1:3,NA,NA,6,NA,10:8))
df %>% mutate(ab =Coalesce(a,b))

Note that implementing the vectorization in C / C++ will be faster, but there isn't a magicPony package that will write the function for you.请注意,在 C/C++ 中实现矢量化会更快,但没有一个magicPony包可以为您编写函数。

You need to group by row:您需要按行分组:

iris %>% group_by(1:n()) %>% mutate(Max.Len= max(Sepal.Length,Petal.Length))

This is what the 1 did in adply .这就是1adplyadply

Update 2017-08-03更新 2017-08-03

After writing this, Hadley changed some stuff again.写完这些后,哈德利又改了一些东西。 The functions that used to be in purrr are now in a new mixed package called purrrlyr , described as:过去在 purrr 中的函数现在在一个名为purrrlyr的新混合包中,描述为:

purrrlyr contains some functions that lie at the intersection of purrr and dplyr. purrrlyr 包含一些位于 purrr 和 dplyr 交集处的函数。 They have been removed from purrr in order to make the package lighter and because they have been replaced by other solutions in the tidyverse.它们已从 purrr 中删除,以便使包装更轻,并且因为它们已被 tidyverse 中的其他解决方案所取代。

So, you will need to install + load that package to make the code below work.因此,您需要安装 + 加载该软件包才能使下面的代码正常工作。

Original post原帖

Hadley frequently changes his mind about what we should use, but I think we are supposed to switch to the functions in purrr to get the by row functionality. Hadley 经常改变关于我们应该使用什么的想法,但我认为我们应该切换到purrr 中的函数来获得逐行功能。 At least, they offer the same functionality and have almost the same interface as adply from plyr .至少,它们提供与adplyadply相同的功能和几乎相同的界面。

There are two related functions, by_row and invoke_rows .有两个相关的函数, by_rowinvoke_rows My understanding is that you use by_row when you want to loop over rows and add the results to the data.frame.我的理解是,当您想要遍历行并将结果添加到by_row时,您可以使用 by_row。 invoke_rows is used when you loop over rows of a data.frame and pass each col as an argument to a function. invoke_rows是你时使用循环在data.frame的行和通过每栏作为参数的函数。 We will only use the first.我们只会使用第一个。

Examples例子

library(tidyverse)

iris %>% 
  by_row(..f = function(this_row) {
    browser()
  })

This lets us see the internals (so we can see what we are doing), which is the same as doing it with adply .这让我们可以看到内部(所以我们可以看到我们在做什么),这与使用adply做的一样。

Called from: ..f(.d[[i]], ...)
Browse[1]> this_row
# A tibble: 1 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
1          5.1         3.5          1.4         0.2  setosa
Browse[1]> Q

By default, by_row adds a list column based on the output:默认情况下, by_row根据输出添加一个列表列:

iris %>% 
  by_row(..f = function(this_row) {
      this_row[1:4] %>% unlist %>% mean
  })

gives:给出:

# A tibble: 150 × 6
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species      .out
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>    <list>
1           5.1         3.5          1.4         0.2  setosa <dbl [1]>
2           4.9         3.0          1.4         0.2  setosa <dbl [1]>
3           4.7         3.2          1.3         0.2  setosa <dbl [1]>
4           4.6         3.1          1.5         0.2  setosa <dbl [1]>
5           5.0         3.6          1.4         0.2  setosa <dbl [1]>
6           5.4         3.9          1.7         0.4  setosa <dbl [1]>
7           4.6         3.4          1.4         0.3  setosa <dbl [1]>
8           5.0         3.4          1.5         0.2  setosa <dbl [1]>
9           4.4         2.9          1.4         0.2  setosa <dbl [1]>
10          4.9         3.1          1.5         0.1  setosa <dbl [1]>
# ... with 140 more rows

if instead we return a data.frame , we get a list with data.frame s:如果我们返回一个data.frame ,我们会得到一个包含data.frame的列表:

iris %>% 
  by_row( ..f = function(this_row) {
    data.frame(
      new_col_mean = this_row[1:4] %>% unlist %>% mean,
      new_col_median = this_row[1:4] %>% unlist %>% median
    )
  })

gives:给出:

# A tibble: 150 × 6
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species                 .out
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>               <list>
1           5.1         3.5          1.4         0.2  setosa <data.frame [1 × 2]>
2           4.9         3.0          1.4         0.2  setosa <data.frame [1 × 2]>
3           4.7         3.2          1.3         0.2  setosa <data.frame [1 × 2]>
4           4.6         3.1          1.5         0.2  setosa <data.frame [1 × 2]>
5           5.0         3.6          1.4         0.2  setosa <data.frame [1 × 2]>
6           5.4         3.9          1.7         0.4  setosa <data.frame [1 × 2]>
7           4.6         3.4          1.4         0.3  setosa <data.frame [1 × 2]>
8           5.0         3.4          1.5         0.2  setosa <data.frame [1 × 2]>
9           4.4         2.9          1.4         0.2  setosa <data.frame [1 × 2]>
10          4.9         3.1          1.5         0.1  setosa <data.frame [1 × 2]>
# ... with 140 more rows

How we add the output of the function is controlled by the .collate param.我们如何添加函数的输出由.collate参数控制。 There's three options: list, rows, cols.共有三个选项:列表、行、列。 When our output has length 1, it doesn't matter whether we use rows or cols.当我们的输出长度为 1 时,我们使用行还是列都没有关系。

iris %>% 
  by_row(.collate = "cols", ..f = function(this_row) {
    this_row[1:4] %>% unlist %>% mean
  })

iris %>% 
  by_row(.collate = "rows", ..f = function(this_row) {
    this_row[1:4] %>% unlist %>% mean
  })

both produce:两者都产生:

# A tibble: 150 × 6
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  .out
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr> <dbl>
1           5.1         3.5          1.4         0.2  setosa 2.550
2           4.9         3.0          1.4         0.2  setosa 2.375
3           4.7         3.2          1.3         0.2  setosa 2.350
4           4.6         3.1          1.5         0.2  setosa 2.350
5           5.0         3.6          1.4         0.2  setosa 2.550
6           5.4         3.9          1.7         0.4  setosa 2.850
7           4.6         3.4          1.4         0.3  setosa 2.425
8           5.0         3.4          1.5         0.2  setosa 2.525
9           4.4         2.9          1.4         0.2  setosa 2.225
10          4.9         3.1          1.5         0.1  setosa 2.400
# ... with 140 more rows

If we output a data.frame with 1 row, it matters only slightly which we use:如果我们输出一个包含 1 行的 data.frame,那么我们使用的只是稍微重要:

iris %>% 
  by_row(.collate = "cols", ..f = function(this_row) {
    data.frame(
      new_col_mean = this_row[1:4] %>% unlist %>% mean,
      new_col_median = this_row[1:4] %>% unlist %>% median
      )
  })

iris %>% 
  by_row(.collate = "rows", ..f = function(this_row) {
    data.frame(
      new_col_mean = this_row[1:4] %>% unlist %>% mean,
      new_col_median = this_row[1:4] %>% unlist %>% median
    )
  })

both give:都给:

# A tibble: 150 × 8
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  .row new_col_mean new_col_median
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr> <int>        <dbl>          <dbl>
1           5.1         3.5          1.4         0.2  setosa     1        2.550           2.45
2           4.9         3.0          1.4         0.2  setosa     2        2.375           2.20
3           4.7         3.2          1.3         0.2  setosa     3        2.350           2.25
4           4.6         3.1          1.5         0.2  setosa     4        2.350           2.30
5           5.0         3.6          1.4         0.2  setosa     5        2.550           2.50
6           5.4         3.9          1.7         0.4  setosa     6        2.850           2.80
7           4.6         3.4          1.4         0.3  setosa     7        2.425           2.40
8           5.0         3.4          1.5         0.2  setosa     8        2.525           2.45
9           4.4         2.9          1.4         0.2  setosa     9        2.225           2.15
10          4.9         3.1          1.5         0.1  setosa    10        2.400           2.30
# ... with 140 more rows

except that the second has the column called .row and the first does not.除了第二个有名为.row的列,第一个没有。

Finally, if our output is longer than length 1 either as a vector or as a data.frame with rows, then it matters whether we use rows or cols for .collate :最后,如果我们的输出作为vector或作为带有行的data.frame长度大于长度 1,那么对于.collate使用行还是列很重要:

mtcars[1:2] %>% by_row(function(x) 1:5)
mtcars[1:2] %>% by_row(function(x) 1:5, .collate = "rows")
mtcars[1:2] %>% by_row(function(x) 1:5, .collate = "cols")

produces, respectively:分别产生:

# A tibble: 32 × 3
     mpg   cyl      .out
   <dbl> <dbl>    <list>
1   21.0     6 <int [5]>
2   21.0     6 <int [5]>
3   22.8     4 <int [5]>
4   21.4     6 <int [5]>
5   18.7     8 <int [5]>
6   18.1     6 <int [5]>
7   14.3     8 <int [5]>
8   24.4     4 <int [5]>
9   22.8     4 <int [5]>
10  19.2     6 <int [5]>
# ... with 22 more rows

# A tibble: 160 × 4
     mpg   cyl  .row  .out
   <dbl> <dbl> <int> <int>
1     21     6     1     1
2     21     6     1     2
3     21     6     1     3
4     21     6     1     4
5     21     6     1     5
6     21     6     2     1
7     21     6     2     2
8     21     6     2     3
9     21     6     2     4
10    21     6     2     5
# ... with 150 more rows

# A tibble: 32 × 7
     mpg   cyl .out1 .out2 .out3 .out4 .out5
   <dbl> <dbl> <int> <int> <int> <int> <int>
1   21.0     6     1     2     3     4     5
2   21.0     6     1     2     3     4     5
3   22.8     4     1     2     3     4     5
4   21.4     6     1     2     3     4     5
5   18.7     8     1     2     3     4     5
6   18.1     6     1     2     3     4     5
7   14.3     8     1     2     3     4     5
8   24.4     4     1     2     3     4     5
9   22.8     4     1     2     3     4     5
10  19.2     6     1     2     3     4     5
# ... with 22 more rows

So, bottom line.所以,底线。 If you want the adply(.margins = 1, ...) functionality, you can use by_row .如果你想要adply(.margins = 1, ...)功能,你可以使用by_row

Extending BrodieG's answer,扩展 BrodieG 的回答,

If the function returns more than one row, then instead of mutate() , do() must be used.如果函数返回多于一行,则必须使用do()而不是mutate() Then to combine it back together, use rbind_all() from the dplyr package.然后将其重新组合在一起,使用dplyr包中的rbind_all()

In dplyr version dplyr_0.1.2 , using 1:n() in the group_by() clause doesn't work for me.dplyr版本dplyr_0.1.2 ,在group_by()子句中使用1:n()对我不起作用。 Hopefully Hadley will implement rowwise() soon.希望哈德利很快就会实现rowwise()

iris %>%
    group_by(1:nrow(iris)) %>%
    do(do_fn) %>%
    rbind_all()

Testing the performance,测试性能,

library(plyr)    # plyr_1.8.4.9000
library(dplyr)   # dplyr_0.8.0.9000
library(purrr)   # purrr_0.2.99.9000
library(microbenchmark)

d1_count <- 1000
d2_count <- 10

d1 <- data.frame(a=runif(d1_count))

do_fn <- function(row){data.frame(a=row$a, b=runif(d2_count))}
do_fn2 <- function(a){data.frame(a=a, b=runif(d2_count))}

op <- microbenchmark(
        plyr_version = plyr::adply(d1, 1, do_fn),
        dplyr_version = d1 %>%
            dplyr::group_by(1:nrow(d1)) %>%
            dplyr::do(do_fn(.)) %>%
            dplyr::bind_rows(),
        purrr_version = d1 %>% purrr::pmap_dfr(do_fn2),
        times=50)

it has the following results:它有以下结果:

Unit: milliseconds
          expr       min        lq      mean    median        uq       max neval
  plyr_version 1227.2589 1275.1363 1317.3431 1293.5759 1314.4266 1616.5449    50
 dplyr_version  977.3025 1012.6340 1035.9436 1025.6267 1040.5882 1449.0978    50
 purrr_version  609.5790  629.7565  643.8498  644.2505  656.1959  686.8128    50

This shows that the new purrr version is the fastest这说明新的purrr版本是最快的

像这样的东西?

iris$Max.Len <- pmax(iris$Sepal.Length, iris$Petal.Length)

In addition to the great answer provided by @alexwhan, please keep in mind that you need to use ungroup() to avoid side effects.除了@alexwhan 提供的出色答案之外,请记住您需要使用ungroup()来避免副作用。 This is because rowwise() is a grouping operation.这是因为rowwise()是一个分组操作。

iris %>%
    rowwise() %>%
    mutate(Max.Len = max(Sepal.Length, Petal.Length))

will give you:会给你:

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Max.Len
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     <dbl>
 1          5.1         3.5          1.4         0.2 setosa      5.1
 2          4.9         3            1.4         0.2 setosa      4.9
 3          4.7         3.2          1.3         0.2 setosa      4.7
 4          4.6         3.1          1.5         0.2 setosa      4.6
 5          5           3.6          1.4         0.2 setosa      5  
 6          5.4         3.9          1.7         0.4 setosa      5.4
 7          4.6         3.4          1.4         0.3 setosa      4.6
 8          5           3.4          1.5         0.2 setosa      5  
 9          4.4         2.9          1.4         0.2 setosa      4.4
10          4.9         3.1          1.5         0.1 setosa      4.9

Now let's assume that you need to continue with the dplyr pipe to add a lead to Max.Len :现在让我们假设你需要继续dplyr管到添加leadMax.Len

iris %>%
    rowwise() %>%
    mutate(Max.Len = max(Sepal.Length, Petal.Length)) %>%
    mutate(Lead.Max.Len = lead(Max.Len))

This will produce:这将产生:

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Max.Len Lead.Max.Len
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     <dbl>        <dbl>
 1          5.1         3.5          1.4         0.2 setosa      5.1           NA
 2          4.9         3            1.4         0.2 setosa      4.9           NA
 3          4.7         3.2          1.3         0.2 setosa      4.7           NA
 4          4.6         3.1          1.5         0.2 setosa      4.6           NA
 5          5           3.6          1.4         0.2 setosa      5             NA
 6          5.4         3.9          1.7         0.4 setosa      5.4           NA
 7          4.6         3.4          1.4         0.3 setosa      4.6           NA
 8          5           3.4          1.5         0.2 setosa      5             NA
 9          4.4         2.9          1.4         0.2 setosa      4.4           NA
10          4.9         3.1          1.5         0.1 setosa      4.9           NA

NA 's are produced as a side effect. NA是作为副作用产生的。 This can be corrected with ungroup() :这可以通过ungroup()来纠正:

iris %>%
    rowwise() %>%
    mutate(Max.Len = max(Sepal.Length, Petal.Length)) %>%
    ungroup() %>%
    mutate(Lead.Max.Len = lead(Max.Len))

This will produce the desired output:这将产生所需的输出:

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Max.Len lead.max.len
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     <dbl>        <dbl>
 1          5.1         3.5          1.4         0.2 setosa      5.1          4.9
 2          4.9         3            1.4         0.2 setosa      4.9          4.7
 3          4.7         3.2          1.3         0.2 setosa      4.7          4.6
 4          4.6         3.1          1.5         0.2 setosa      4.6          5  
 5          5           3.6          1.4         0.2 setosa      5            5.4
 6          5.4         3.9          1.7         0.4 setosa      5.4          4.6
 7          4.6         3.4          1.4         0.3 setosa      4.6          5  
 8          5           3.4          1.5         0.2 setosa      5            4.4
 9          4.4         2.9          1.4         0.2 setosa      4.4          4.9
10          4.9         3.1          1.5         0.1 setosa      4.9          5.4

Just for completeness I am going to change the code of this user from the forgotten answer (and maybe the best answer) of the question: Sum across multiple columns.为了完整起见,我将从问题的遗忘答案(也许是最佳答案)中更改此用户的代码:跨多个列求和。 And apply it to your problem:并将其应用于您的问题:

iris %>%
  mutate(max = select(.,c('Sepal.Length','Petal.Length')) %>% 
  apply(1, max, na.rm=TRUE))

The Result is expected.结果是预期的。 Accepted answer said that rowwise is increasingly not recommended, and apply is base R. Uou don't need to import an extra package like purrr.接受的答案是越来越不推荐 rowwise ,并且 apply 是 base R. Uou 不需要导入像 purrr 这样的额外包。

You can use apply() function with max, min, sum, median, mean.您可以将 apply() 函数与 max、min、sum、median、mean 一起使用。 So it's very handy and simple.所以它非常方便和简单。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM