简体   繁体   中英

dplyr sample_n where n is the value of a grouped variable

I have the following grouped data frame, and I would like to use the function dplyr::sample_n to extract rows from this data frame for each group. I want to use the value of the grouped variable NDG in each group as the number of rows to extract from each group.

> dg.tmp <- structure(list(Gene = c("CAMK1", "GHRL", "TIMP4", "CAMK1", "GHRL", 
"TIMP4", "ARL8B", "ARPC4", "SEC13", "ARL8B", "ARPC4", "SEC13"
), GLB = c(3, 3, 3, 3, 3, 3, 10, 10, 10, 10, 10, 10), NDG = c(1, 
1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2)), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -12L), .Names = c("Gene", "GLB", 
"NDG"))

> dg <- dg.tmp %>% 
     dplyr::group_by(GLB,NDG)

> dg
Source: local data frame [12 x 3]
Groups: GLB, NDG

      Gene GLB NDG
1    A4GNT   3   1
2    ABTB1   3   1
3     AHSG   3   1
4    A4GNT   3   2
5    ABTB1   3   2
6     AHSG   3   2
7    AADAC  10   1
8  ABHD14B  10   1
9   ACVR2B  10   1
10   AADAC  10   2
11 ABHD14B  10   2
12  ACVR2B  10   2

For example, assuming the correct random selection, I want the code

> dg %>% dplyr::sample_n(NDG)

to output:

Source: local data frame [6 x 3]
Groups: GLB, NDG

      Gene GLB NDG
1    A4GNT   3   1
2    A4GNT   3   2
3    ABTB1   3   2
4    AADAC  10   1
5    AADAC  10   2
6  ABHD14B  10   2

However, it gives the following error:

Error in eval(expr, envir, enclos) : object 'NDG' not found

By way of comparison, dplyr::slice gives the correct output when I use the code

> dg %>% dplyr::slice(1:unique(NDG))

It is slightly hackish using unique in this context, however, the code

> dg %>% dplyr::slice(1:NDG)

returns the following warning messages

Warning messages:
1: In slice_impl(.data, dots) :
  numerical expression has 3 elements: only the first used
2: In slice_impl(.data, dots) :
  numerical expression has 3 elements: only the first used
3: In slice_impl(.data, dots) :
  numerical expression has 3 elements: only the first used
4: In slice_impl(.data, dots) :
  numerical expression has 3 elements: only the first used

clearly because NDG is being evaluated (in the appropriate environment) as c(1,1,1) or c(2,2,2) , and hence 1:NDG returns the above warning.


Regarding why I obtain the error, I know that the code Hadley uses for the method sample_n.grouped_df is

sample_n.grouped_df <- function(tbl, size, replace = FALSE, weight = NULL,
  .env = parent.frame()) {

  assert_that(is.numeric(size), length(size) == 1, size >= 0)
  weight <- substitute(weight)

  index <- attr(tbl, "indices")
  sampled <- lapply(index, sample_group, frac = FALSE,
    tbl = tbl, size = size, replace = replace, weight = weight, .env = .env)
  idx <- unlist(sampled) + 1

  grouped_df(tbl[idx, , drop = FALSE], vars = groups(tbl))
}

which can be found on the relevant Github page . Thus I obtain the error because sample_n.grouped_df cannot find the variable NGD because it's not looking in the correct environment.

Consequently, is there a neat way of using sample_n on dg to obtain

Source: local data frame [6 x 3]
Groups: GLB, NDG

      Gene GLB NDG
1    A4GNT   3   1
2    A4GNT   3   2
3    ABTB1   3   2
4    AADAC  10   1
5    AADAC  10   2
6  ABHD14B  10   2

by using random sampling on each group?

One possible answer, but I'm not convinced it's the optimal answer: permute the rows of the data frame with dplyr::sample_frac (and a fraction of 1), then slice the required number of rows:

> set.seed(1)
> dg %>% 
      dplyr::sample_frac(1) %>%
      dplyr::slice(1:unique(NDG))

This gives the correct output.

Source: local data frame [6 x 3]
Groups: GLB, NDG

    Gene GLB NDG
1  A4GNT   3   1
2   AHSG   3   2
3  A4GNT   3   2
4 ACVR2B  10   1
5  AADAC  10   2
6 ACVR2B  10   2

And I suppose I can just write a function to do this in one line if necessary.

Here's an alternative answer, although the one above seems fine:

dg %>% 
  sample_frac(1) %>%
  filter(row_number() <= NDG) %>%
  arrange(NDG)

Source: local data frame [6 x 3]
Groups: GLB, NDG

     Gene GLB NDG
1    AHSG   3   1
2   ABTB1   3   2
3    AHSG   3   2
4 ABHD14B  10   1
5   AADAC  10   2
6 ABHD14B  10   2

The sample_frac reorders the dataframe, and assigns new row numbers to each group, and then you just take the first NDG number of rows. The arrange doesn't do anything but reorder your data to make it look like in your desired output.

I ran into the same problem using grouped dfs, and remembered there's a very elegant way to do this in purrr , as outlined in this very helpful tutorial :

library(purrr)

dg.tmp %>% 
  nest(-GLB, -NDG) %>% 
  mutate(data = map2(data, NDG, sample_n)) %>% 
  unnest

One advantage is that it doesn't require the permutation of ALL rows of data as with sample_frac , which could be quite costly with a large dataframe.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM