[英]Applying a function to every row of a table using dplyr?
When working with plyr
I often found it useful to use adply
for scalar functions that I have to apply to each and every row.在使用plyr
时,我经常发现将adply
用于我必须应用于每一行的标量函数很有用。
eg例如
data(iris)
library(plyr)
head(
adply(iris, 1, transform , Max.Len= max(Sepal.Length,Petal.Length))
)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Max.Len
1 5.1 3.5 1.4 0.2 setosa 5.1
2 4.9 3.0 1.4 0.2 setosa 4.9
3 4.7 3.2 1.3 0.2 setosa 4.7
4 4.6 3.1 1.5 0.2 setosa 4.6
5 5.0 3.6 1.4 0.2 setosa 5.0
6 5.4 3.9 1.7 0.4 setosa 5.4
Now I'm using dplyr
more, I'm wondering if there is a tidy/natural way to do this?现在我更多地使用dplyr
,我想知道是否有一种整洁/自然的方式来做到这一点? As this is NOT what I want:因为这不是我想要的:
library(dplyr)
head(
mutate(iris, Max.Len= max(Sepal.Length,Petal.Length))
)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Max.Len
1 5.1 3.5 1.4 0.2 setosa 7.9
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.9
5 5.0 3.6 1.4 0.2 setosa 7.9
6 5.4 3.9 1.7 0.4 setosa 7.9
As of dplyr 0.2 (I think) rowwise()
is implemented, so the answer to this problem becomes:从 dplyr 0.2 (我认为) rowwise()
,这个问题的答案就变成了:
iris %>%
rowwise() %>%
mutate(Max.Len= max(Sepal.Length,Petal.Length))
rowwise
alternative非rowwise
替代Five years (!) later this answer still gets a lot of traffic.五年(!)之后,这个答案仍然获得了大量流量。 Since it was given, rowwise
is increasingly not recommended, although lots of people seem to find it intuitive.自从给出它以来,越来越不推荐rowwise
,尽管很多人似乎觉得它很直观。 Do yourself a favour and go through Jenny Bryan's Row-oriented workflows in R with the tidyverse material to get a good handle on this topic.帮自己一个忙,使用 tidyverse材料在 R 中完成 Jenny Bryan 的面向行的工作流程,以很好地处理这个主题。
The most straightforward way I have found is based on one of Hadley's examples using pmap
:我发现的最直接的方法是基于 Hadley 使用pmap
的示例之一:
iris %>%
mutate(Max.Len= purrr::pmap_dbl(list(Sepal.Length, Petal.Length), max))
Using this approach, you can give an arbitrary number of arguments to the function ( .f
) inside pmap
.使用这种方法,您可以为pmap
的函数 ( .f
) 提供任意数量的参数。
pmap
is a good conceptual approach because it reflects the fact that when you're doing row wise operations you're actually working with tuples from a list of vectors (the columns in a dataframe). pmap
是一种很好的概念方法,因为它反映了这样一个事实,即当您进行行明智的操作时,您实际上是在使用向量列表(数据帧中的列)中的元组。
The idiomatic approach will be to create an appropriately vectorised function.惯用的方法是创建一个适当的矢量化函数。
R
provide pmax
which is suitable here, however it also provides Vectorize
as a wrapper for mapply
to allow you to create a vectorised arbitrary version of an arbitrary function. R
提供了适用于此处的pmax
,但它还提供Vectorize
作为mapply
的包装器,以允许您创建任意函数的矢量化任意版本。
library(dplyr)
# use base R pmax (vectorized in C)
iris %>% mutate(max.len = pmax(Sepal.Length, Petal.Length))
# use vectorize to create your own function
# for example, a horribly inefficient get first non-Na value function
# a version that is not vectorized
coalesce <- function(a,b) {r <- c(a[1],b[1]); r[!is.na(r)][1]}
# a vectorized version
Coalesce <- Vectorize(coalesce, vectorize.args = c('a','b'))
# some example data
df <- data.frame(a = c(1:5,NA,7:10), b = c(1:3,NA,NA,6,NA,10:8))
df %>% mutate(ab =Coalesce(a,b))
Note that implementing the vectorization in C / C++ will be faster, but there isn't a magicPony
package that will write the function for you.请注意,在 C/C++ 中实现矢量化会更快,但没有一个magicPony
包可以为您编写函数。
You need to group by row:您需要按行分组:
iris %>% group_by(1:n()) %>% mutate(Max.Len= max(Sepal.Length,Petal.Length))
This is what the 1
did in adply
.这就是1
在adply
中adply
。
After writing this, Hadley changed some stuff again.写完这些后,哈德利又改了一些东西。 The functions that used to be in purrr are now in a new mixed package called purrrlyr , described as:过去在 purrr 中的函数现在在一个名为purrrlyr的新混合包中,描述为:
purrrlyr contains some functions that lie at the intersection of purrr and dplyr. purrrlyr 包含一些位于 purrr 和 dplyr 交集处的函数。 They have been removed from purrr in order to make the package lighter and because they have been replaced by other solutions in the tidyverse.它们已从 purrr 中删除,以便使包装更轻,并且因为它们已被 tidyverse 中的其他解决方案所取代。
So, you will need to install + load that package to make the code below work.因此,您需要安装 + 加载该软件包才能使下面的代码正常工作。
Hadley frequently changes his mind about what we should use, but I think we are supposed to switch to the functions in purrr to get the by row functionality. Hadley 经常改变关于我们应该使用什么的想法,但我认为我们应该切换到purrr 中的函数来获得逐行功能。 At least, they offer the same functionality and have almost the same interface as adply
from plyr .至少,它们提供与adply
的adply相同的功能和几乎相同的界面。
There are two related functions, by_row
and invoke_rows
.有两个相关的函数, by_row
和invoke_rows
。 My understanding is that you use by_row
when you want to loop over rows and add the results to the data.frame.我的理解是,当您想要遍历行并将结果添加到by_row
时,您可以使用 by_row。 invoke_rows
is used when you loop over rows of a data.frame and pass each col as an argument to a function. invoke_rows
是你时使用循环在data.frame的行和通过每栏作为参数的函数。 We will only use the first.我们只会使用第一个。
library(tidyverse)
iris %>%
by_row(..f = function(this_row) {
browser()
})
This lets us see the internals (so we can see what we are doing), which is the same as doing it with adply
.这让我们可以看到内部(所以我们可以看到我们在做什么),这与使用adply
做的一样。
Called from: ..f(.d[[i]], ...)
Browse[1]> this_row
# A tibble: 1 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fctr>
1 5.1 3.5 1.4 0.2 setosa
Browse[1]> Q
By default, by_row
adds a list column based on the output:默认情况下, by_row
根据输出添加一个列表列:
iris %>%
by_row(..f = function(this_row) {
this_row[1:4] %>% unlist %>% mean
})
gives:给出:
# A tibble: 150 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species .out
<dbl> <dbl> <dbl> <dbl> <fctr> <list>
1 5.1 3.5 1.4 0.2 setosa <dbl [1]>
2 4.9 3.0 1.4 0.2 setosa <dbl [1]>
3 4.7 3.2 1.3 0.2 setosa <dbl [1]>
4 4.6 3.1 1.5 0.2 setosa <dbl [1]>
5 5.0 3.6 1.4 0.2 setosa <dbl [1]>
6 5.4 3.9 1.7 0.4 setosa <dbl [1]>
7 4.6 3.4 1.4 0.3 setosa <dbl [1]>
8 5.0 3.4 1.5 0.2 setosa <dbl [1]>
9 4.4 2.9 1.4 0.2 setosa <dbl [1]>
10 4.9 3.1 1.5 0.1 setosa <dbl [1]>
# ... with 140 more rows
if instead we return a data.frame
, we get a list with data.frame
s:如果我们返回一个data.frame
,我们会得到一个包含data.frame
的列表:
iris %>%
by_row( ..f = function(this_row) {
data.frame(
new_col_mean = this_row[1:4] %>% unlist %>% mean,
new_col_median = this_row[1:4] %>% unlist %>% median
)
})
gives:给出:
# A tibble: 150 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species .out
<dbl> <dbl> <dbl> <dbl> <fctr> <list>
1 5.1 3.5 1.4 0.2 setosa <data.frame [1 × 2]>
2 4.9 3.0 1.4 0.2 setosa <data.frame [1 × 2]>
3 4.7 3.2 1.3 0.2 setosa <data.frame [1 × 2]>
4 4.6 3.1 1.5 0.2 setosa <data.frame [1 × 2]>
5 5.0 3.6 1.4 0.2 setosa <data.frame [1 × 2]>
6 5.4 3.9 1.7 0.4 setosa <data.frame [1 × 2]>
7 4.6 3.4 1.4 0.3 setosa <data.frame [1 × 2]>
8 5.0 3.4 1.5 0.2 setosa <data.frame [1 × 2]>
9 4.4 2.9 1.4 0.2 setosa <data.frame [1 × 2]>
10 4.9 3.1 1.5 0.1 setosa <data.frame [1 × 2]>
# ... with 140 more rows
How we add the output of the function is controlled by the .collate
param.我们如何添加函数的输出由.collate
参数控制。 There's three options: list, rows, cols.共有三个选项:列表、行、列。 When our output has length 1, it doesn't matter whether we use rows or cols.当我们的输出长度为 1 时,我们使用行还是列都没有关系。
iris %>%
by_row(.collate = "cols", ..f = function(this_row) {
this_row[1:4] %>% unlist %>% mean
})
iris %>%
by_row(.collate = "rows", ..f = function(this_row) {
this_row[1:4] %>% unlist %>% mean
})
both produce:两者都产生:
# A tibble: 150 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species .out
<dbl> <dbl> <dbl> <dbl> <fctr> <dbl>
1 5.1 3.5 1.4 0.2 setosa 2.550
2 4.9 3.0 1.4 0.2 setosa 2.375
3 4.7 3.2 1.3 0.2 setosa 2.350
4 4.6 3.1 1.5 0.2 setosa 2.350
5 5.0 3.6 1.4 0.2 setosa 2.550
6 5.4 3.9 1.7 0.4 setosa 2.850
7 4.6 3.4 1.4 0.3 setosa 2.425
8 5.0 3.4 1.5 0.2 setosa 2.525
9 4.4 2.9 1.4 0.2 setosa 2.225
10 4.9 3.1 1.5 0.1 setosa 2.400
# ... with 140 more rows
If we output a data.frame with 1 row, it matters only slightly which we use:如果我们输出一个包含 1 行的 data.frame,那么我们使用的只是稍微重要:
iris %>%
by_row(.collate = "cols", ..f = function(this_row) {
data.frame(
new_col_mean = this_row[1:4] %>% unlist %>% mean,
new_col_median = this_row[1:4] %>% unlist %>% median
)
})
iris %>%
by_row(.collate = "rows", ..f = function(this_row) {
data.frame(
new_col_mean = this_row[1:4] %>% unlist %>% mean,
new_col_median = this_row[1:4] %>% unlist %>% median
)
})
both give:都给:
# A tibble: 150 × 8
Sepal.Length Sepal.Width Petal.Length Petal.Width Species .row new_col_mean new_col_median
<dbl> <dbl> <dbl> <dbl> <fctr> <int> <dbl> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1 2.550 2.45
2 4.9 3.0 1.4 0.2 setosa 2 2.375 2.20
3 4.7 3.2 1.3 0.2 setosa 3 2.350 2.25
4 4.6 3.1 1.5 0.2 setosa 4 2.350 2.30
5 5.0 3.6 1.4 0.2 setosa 5 2.550 2.50
6 5.4 3.9 1.7 0.4 setosa 6 2.850 2.80
7 4.6 3.4 1.4 0.3 setosa 7 2.425 2.40
8 5.0 3.4 1.5 0.2 setosa 8 2.525 2.45
9 4.4 2.9 1.4 0.2 setosa 9 2.225 2.15
10 4.9 3.1 1.5 0.1 setosa 10 2.400 2.30
# ... with 140 more rows
except that the second has the column called .row
and the first does not.除了第二个有名为.row
的列,第一个没有。
Finally, if our output is longer than length 1 either as a vector
or as a data.frame
with rows, then it matters whether we use rows or cols for .collate
:最后,如果我们的输出作为vector
或作为带有行的data.frame
长度大于长度 1,那么对于.collate
使用行还是列很重要:
mtcars[1:2] %>% by_row(function(x) 1:5)
mtcars[1:2] %>% by_row(function(x) 1:5, .collate = "rows")
mtcars[1:2] %>% by_row(function(x) 1:5, .collate = "cols")
produces, respectively:分别产生:
# A tibble: 32 × 3
mpg cyl .out
<dbl> <dbl> <list>
1 21.0 6 <int [5]>
2 21.0 6 <int [5]>
3 22.8 4 <int [5]>
4 21.4 6 <int [5]>
5 18.7 8 <int [5]>
6 18.1 6 <int [5]>
7 14.3 8 <int [5]>
8 24.4 4 <int [5]>
9 22.8 4 <int [5]>
10 19.2 6 <int [5]>
# ... with 22 more rows
# A tibble: 160 × 4
mpg cyl .row .out
<dbl> <dbl> <int> <int>
1 21 6 1 1
2 21 6 1 2
3 21 6 1 3
4 21 6 1 4
5 21 6 1 5
6 21 6 2 1
7 21 6 2 2
8 21 6 2 3
9 21 6 2 4
10 21 6 2 5
# ... with 150 more rows
# A tibble: 32 × 7
mpg cyl .out1 .out2 .out3 .out4 .out5
<dbl> <dbl> <int> <int> <int> <int> <int>
1 21.0 6 1 2 3 4 5
2 21.0 6 1 2 3 4 5
3 22.8 4 1 2 3 4 5
4 21.4 6 1 2 3 4 5
5 18.7 8 1 2 3 4 5
6 18.1 6 1 2 3 4 5
7 14.3 8 1 2 3 4 5
8 24.4 4 1 2 3 4 5
9 22.8 4 1 2 3 4 5
10 19.2 6 1 2 3 4 5
# ... with 22 more rows
So, bottom line.所以,底线。 If you want the adply(.margins = 1, ...)
functionality, you can use by_row
.如果你想要adply(.margins = 1, ...)
功能,你可以使用by_row
。
Extending BrodieG's answer,扩展 BrodieG 的回答,
If the function returns more than one row, then instead of mutate()
, do()
must be used.如果函数返回多于一行,则必须使用do()
而不是mutate()
。 Then to combine it back together, use rbind_all()
from the dplyr
package.然后将其重新组合在一起,使用dplyr
包中的rbind_all()
。
In dplyr
version dplyr_0.1.2
, using 1:n()
in the group_by()
clause doesn't work for me.在dplyr
版本dplyr_0.1.2
,在group_by()
子句中使用1:n()
对我不起作用。 Hopefully Hadley will implement rowwise()
soon.希望哈德利很快就会实现rowwise()
。
iris %>%
group_by(1:nrow(iris)) %>%
do(do_fn) %>%
rbind_all()
Testing the performance,测试性能,
library(plyr) # plyr_1.8.4.9000
library(dplyr) # dplyr_0.8.0.9000
library(purrr) # purrr_0.2.99.9000
library(microbenchmark)
d1_count <- 1000
d2_count <- 10
d1 <- data.frame(a=runif(d1_count))
do_fn <- function(row){data.frame(a=row$a, b=runif(d2_count))}
do_fn2 <- function(a){data.frame(a=a, b=runif(d2_count))}
op <- microbenchmark(
plyr_version = plyr::adply(d1, 1, do_fn),
dplyr_version = d1 %>%
dplyr::group_by(1:nrow(d1)) %>%
dplyr::do(do_fn(.)) %>%
dplyr::bind_rows(),
purrr_version = d1 %>% purrr::pmap_dfr(do_fn2),
times=50)
it has the following results:它有以下结果:
Unit: milliseconds
expr min lq mean median uq max neval
plyr_version 1227.2589 1275.1363 1317.3431 1293.5759 1314.4266 1616.5449 50
dplyr_version 977.3025 1012.6340 1035.9436 1025.6267 1040.5882 1449.0978 50
purrr_version 609.5790 629.7565 643.8498 644.2505 656.1959 686.8128 50
This shows that the new purrr
version is the fastest这说明新的purrr
版本是最快的
像这样的东西?
iris$Max.Len <- pmax(iris$Sepal.Length, iris$Petal.Length)
In addition to the great answer provided by @alexwhan, please keep in mind that you need to use ungroup()
to avoid side effects.除了@alexwhan 提供的出色答案之外,请记住您需要使用ungroup()
来避免副作用。 This is because rowwise()
is a grouping operation.这是因为rowwise()
是一个分组操作。
iris %>%
rowwise() %>%
mutate(Max.Len = max(Sepal.Length, Petal.Length))
will give you:会给你:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Max.Len
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 5.1
2 4.9 3 1.4 0.2 setosa 4.9
3 4.7 3.2 1.3 0.2 setosa 4.7
4 4.6 3.1 1.5 0.2 setosa 4.6
5 5 3.6 1.4 0.2 setosa 5
6 5.4 3.9 1.7 0.4 setosa 5.4
7 4.6 3.4 1.4 0.3 setosa 4.6
8 5 3.4 1.5 0.2 setosa 5
9 4.4 2.9 1.4 0.2 setosa 4.4
10 4.9 3.1 1.5 0.1 setosa 4.9
Now let's assume that you need to continue with the dplyr
pipe to add a lead
to Max.Len
:现在让我们假设你需要继续dplyr
管到添加lead
来Max.Len
:
iris %>%
rowwise() %>%
mutate(Max.Len = max(Sepal.Length, Petal.Length)) %>%
mutate(Lead.Max.Len = lead(Max.Len))
This will produce:这将产生:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Max.Len Lead.Max.Len
<dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
1 5.1 3.5 1.4 0.2 setosa 5.1 NA
2 4.9 3 1.4 0.2 setosa 4.9 NA
3 4.7 3.2 1.3 0.2 setosa 4.7 NA
4 4.6 3.1 1.5 0.2 setosa 4.6 NA
5 5 3.6 1.4 0.2 setosa 5 NA
6 5.4 3.9 1.7 0.4 setosa 5.4 NA
7 4.6 3.4 1.4 0.3 setosa 4.6 NA
8 5 3.4 1.5 0.2 setosa 5 NA
9 4.4 2.9 1.4 0.2 setosa 4.4 NA
10 4.9 3.1 1.5 0.1 setosa 4.9 NA
NA
's are produced as a side effect. NA
是作为副作用产生的。 This can be corrected with ungroup()
:这可以通过ungroup()
来纠正:
iris %>%
rowwise() %>%
mutate(Max.Len = max(Sepal.Length, Petal.Length)) %>%
ungroup() %>%
mutate(Lead.Max.Len = lead(Max.Len))
This will produce the desired output:这将产生所需的输出:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Max.Len lead.max.len
<dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
1 5.1 3.5 1.4 0.2 setosa 5.1 4.9
2 4.9 3 1.4 0.2 setosa 4.9 4.7
3 4.7 3.2 1.3 0.2 setosa 4.7 4.6
4 4.6 3.1 1.5 0.2 setosa 4.6 5
5 5 3.6 1.4 0.2 setosa 5 5.4
6 5.4 3.9 1.7 0.4 setosa 5.4 4.6
7 4.6 3.4 1.4 0.3 setosa 4.6 5
8 5 3.4 1.5 0.2 setosa 5 4.4
9 4.4 2.9 1.4 0.2 setosa 4.4 4.9
10 4.9 3.1 1.5 0.1 setosa 4.9 5.4
Just for completeness I am going to change the code of this user from the forgotten answer (and maybe the best answer) of the question: Sum across multiple columns.为了完整起见,我将从问题的遗忘答案(也许是最佳答案)中更改此用户的代码:跨多个列求和。 And apply it to your problem:并将其应用于您的问题:
iris %>%
mutate(max = select(.,c('Sepal.Length','Petal.Length')) %>%
apply(1, max, na.rm=TRUE))
The Result is expected.结果是预期的。 Accepted answer said that rowwise is increasingly not recommended, and apply is base R. Uou don't need to import an extra package like purrr.接受的答案是越来越不推荐 rowwise ,并且 apply 是 base R. Uou 不需要导入像 purrr 这样的额外包。
You can use apply() function with max, min, sum, median, mean.您可以将 apply() 函数与 max、min、sum、median、mean 一起使用。 So it's very handy and simple.所以它非常方便和简单。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.