简体   繁体   中英

Applying data.frame-consuming functions over groups of rows

For example, suppose I have some data.frame df :

df <- read.table(text = "
P    Q    R
c    1   10
a    1    0
a    2    0
b    2    0
b    1   10
c    2   10
b    1    0
a    2   10
",
stringsAsFactors = FALSE,
header=T)

...and some function foo that takes a data.frame as argument.

One can imagine splitting df into smaller data.frame 's according to the value in one of its columns, say P , and applying foo to each of those smaller data.frame 's.

Below I show the best I can come up with to solve this problem, but I suspect that more streamlined solutions already exist to perform such a natural operation. If so, my question is: what are they?

NB: I show two use-cases below; the first one of the two is the one that I expect can be improved significantly. As for the second one, I think my solution for it may already be about as good as it'll get; I include this use-case just in case my guess is wrong.


My solution depends on whether foo is a function that I call for its return value, or one that I call only for its side effects.

For the former case ( foo called for its value), suppose that foo is this:

## returns a one-row data.frame corresponding to a random row of
## dataframe
## NB: this is *just an example* for the sake of this question
foo <- function (dataframe) {
    dataframe[sample(nrow(dataframe), 1), ]
}

...then my solution would be this:

set.seed(0)
sapply(unique(df$P), function (value) foo(df[df$P == value, ]),
       simplify = FALSE)
## $c
##   P Q  R
## 6 c 2 10
## 
## $a
##   P Q R
## 2 a 1 0
## 
## $b
##   P Q  R
## 5 b 1 10

For the latter case ( foo called for its side-effect), suppose that foo is this:

## prints to stdout a one-row data.frame corresponding to a random
## row of dataframe
## NB: this is *just an example* for the sake of this question
foo <- function (dataframe) {
    cat(str(dataframe[sample(nrow(dataframe), 1), ]))
}

...then my solution would be this:

set.seed(0)
for (value in unique(df$P)) foo(df[df$P == value, ])
## 'data.frame':    1 obs. of  3 variables:
##  $ P: chr "c"
##  $ Q: int 2
##  $ R: int 10
## 'data.frame':    1 obs. of  3 variables:
##  $ P: chr "a"
##  $ Q: int 1
##  $ R: int 0
## 'data.frame':    1 obs. of  3 variables:
##  $ P: chr "b"
##  Q: int 1
##  R: int 10

You can achieve both of your use cases with the function by . To replicate your results, however, we change your functions to return or output the last row of the group instead of a randomly selected row. This is necessary because the ordering of rows within a group is modified by by . In a real use case, this ordering should not matter. This only matters because your results depend on a random number generator to select on the grouped rows.

In your first use case:

foo <- function (dataframe) {
  dataframe[nrow(dataframe), ]
}

out1 <- sapply(unique(df$P), function (value) foo(df[df$P == value, ]),
               simplify = FALSE)

The result out1 is a list :

str(out1)  ## this displays the structure of the out1 object
##List of 3
## $ c:'data.frame':    1 obs. of  3 variables:
##  ..$ P: chr "c"
##  ..$ Q: int 2
##  ..$ R: int 10
## $ a:'data.frame':    1 obs. of  3 variables:
##  ..$ P: chr "a"
##  ..$ Q: int 2
##  ..$ R: int 10
## $ b:'data.frame':    1 obs. of  3 variables:
##  ..$ P: chr "b"
##  ..$ Q: int 1
##  ..$ R: int 0

We can achieve the same result using by , which returns an object of class by , which is a list :

by.out1 <- with(df, by(df, P, foo))
str(by.out1)
##List of 3
## $ a:'data.frame':    1 obs. of  3 variables:
##  ..$ P: chr "a"
##  ..$ Q: int 2
##  ..$ R: int 10
## $ b:'data.frame':    1 obs. of  3 variables:
##  ..$ P: chr "b"
##  ..$ Q: int 1
##  ..$ R: int 0
## $ c:'data.frame':    1 obs. of  3 variables:
##  ..$ P: chr "c"
##  ..$ Q: int 2
##  ..$ R: int 10
## - attr(*, "dim")= int 3
## - attr(*, "dimnames")=List of 1
##  ..$ P: chr [1:3] "a" "b" "c"
## - attr(*, "call")= language by.data.frame(data = df, INDICES = P, FUN = foo)
## - attr(*, "class")= chr "by"

Here, we are using by with with to execute the by within the environment constructed from df . This allows us to specify the columns of df (such as P ) by name without quotes.

For your second use case (which displays to console via cat ):

foo <- function (dataframe) {
  cat(str(dataframe[nrow(dataframe), ]))
}

for (value in unique(df$P)) foo(df[df$P == value, ])
##'data.frame': 1 obs. of  3 variables:
## $ P: chr "c"
## $ Q: int 2
## $ R: int 10
##'data.frame': 1 obs. of  3 variables:
## $ P: chr "a"
## $ Q: int 2
## $ R: int 10
##'data.frame': 1 obs. of  3 variables:
## $ P: chr "b"
## $ Q: int 1
## $ R: int 0

Again, we can achieve the same result with by :

with(df, by(df, P, foo))
##'data.frame': 1 obs. of  3 variables:
## $ P: chr "a"
## $ Q: int 2
## $ R: int 10
##'data.frame': 1 obs. of  3 variables:
## $ P: chr "b"
## $ Q: int 1
## $ R: int 0
##'data.frame': 1 obs. of  3 variables:
## $ P: chr "c"
## $ Q: int 2
## $ R: int 10

The function by is in the base R package. As mentioned by Dave2e, there are many other packages that have similar data manipulation capabilities. Some of them provides more syntactic sugar for ease of use, and others provide better optimization, or both. Some of these are: plyr , dplyr , and data.table . I leave it to you to explore these.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM