I have the following grouped data frame, and I would like to use the function dplyr::sample_n
to extract rows from this data frame for each group. I want to use the value of the grouped variable NDG
in each group as the number of rows to extract from each group.
> dg.tmp <- structure(list(Gene = c("CAMK1", "GHRL", "TIMP4", "CAMK1", "GHRL",
"TIMP4", "ARL8B", "ARPC4", "SEC13", "ARL8B", "ARPC4", "SEC13"
), GLB = c(3, 3, 3, 3, 3, 3, 10, 10, 10, 10, 10, 10), NDG = c(1,
1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -12L), .Names = c("Gene", "GLB",
"NDG"))
> dg <- dg.tmp %>%
dplyr::group_by(GLB,NDG)
> dg
Source: local data frame [12 x 3]
Groups: GLB, NDG
Gene GLB NDG
1 A4GNT 3 1
2 ABTB1 3 1
3 AHSG 3 1
4 A4GNT 3 2
5 ABTB1 3 2
6 AHSG 3 2
7 AADAC 10 1
8 ABHD14B 10 1
9 ACVR2B 10 1
10 AADAC 10 2
11 ABHD14B 10 2
12 ACVR2B 10 2
For example, assuming the correct random selection, I want the code
> dg %>% dplyr::sample_n(NDG)
to output:
Source: local data frame [6 x 3]
Groups: GLB, NDG
Gene GLB NDG
1 A4GNT 3 1
2 A4GNT 3 2
3 ABTB1 3 2
4 AADAC 10 1
5 AADAC 10 2
6 ABHD14B 10 2
However, it gives the following error:
Error in eval(expr, envir, enclos) : object 'NDG' not found
By way of comparison, dplyr::slice
gives the correct output when I use the code
> dg %>% dplyr::slice(1:unique(NDG))
It is slightly hackish using unique
in this context, however, the code
> dg %>% dplyr::slice(1:NDG)
returns the following warning messages
Warning messages:
1: In slice_impl(.data, dots) :
numerical expression has 3 elements: only the first used
2: In slice_impl(.data, dots) :
numerical expression has 3 elements: only the first used
3: In slice_impl(.data, dots) :
numerical expression has 3 elements: only the first used
4: In slice_impl(.data, dots) :
numerical expression has 3 elements: only the first used
clearly because NDG
is being evaluated (in the appropriate environment) as c(1,1,1)
or c(2,2,2)
, and hence 1:NDG
returns the above warning.
Regarding why I obtain the error, I know that the code Hadley uses for the method sample_n.grouped_df is
sample_n.grouped_df <- function(tbl, size, replace = FALSE, weight = NULL,
.env = parent.frame()) {
assert_that(is.numeric(size), length(size) == 1, size >= 0)
weight <- substitute(weight)
index <- attr(tbl, "indices")
sampled <- lapply(index, sample_group, frac = FALSE,
tbl = tbl, size = size, replace = replace, weight = weight, .env = .env)
idx <- unlist(sampled) + 1
grouped_df(tbl[idx, , drop = FALSE], vars = groups(tbl))
}
which can be found on the relevant Github page . Thus I obtain the error because sample_n.grouped_df
cannot find the variable NGD
because it's not looking in the correct environment.
Consequently, is there a neat way of using sample_n
on dg
to obtain
Source: local data frame [6 x 3]
Groups: GLB, NDG
Gene GLB NDG
1 A4GNT 3 1
2 A4GNT 3 2
3 ABTB1 3 2
4 AADAC 10 1
5 AADAC 10 2
6 ABHD14B 10 2
by using random sampling on each group?
One possible answer, but I'm not convinced it's the optimal answer: permute the rows of the data frame with dplyr::sample_frac
(and a fraction of 1), then slice the required number of rows:
> set.seed(1)
> dg %>%
dplyr::sample_frac(1) %>%
dplyr::slice(1:unique(NDG))
This gives the correct output.
Source: local data frame [6 x 3]
Groups: GLB, NDG
Gene GLB NDG
1 A4GNT 3 1
2 AHSG 3 2
3 A4GNT 3 2
4 ACVR2B 10 1
5 AADAC 10 2
6 ACVR2B 10 2
And I suppose I can just write a function to do this in one line if necessary.
Here's an alternative answer, although the one above seems fine:
dg %>%
sample_frac(1) %>%
filter(row_number() <= NDG) %>%
arrange(NDG)
Source: local data frame [6 x 3]
Groups: GLB, NDG
Gene GLB NDG
1 AHSG 3 1
2 ABTB1 3 2
3 AHSG 3 2
4 ABHD14B 10 1
5 AADAC 10 2
6 ABHD14B 10 2
The sample_frac
reorders the dataframe, and assigns new row numbers to each group, and then you just take the first NDG number of rows. The arrange
doesn't do anything but reorder your data to make it look like in your desired output.
I ran into the same problem using grouped dfs, and remembered there's a very elegant way to do this in purrr
, as outlined in this very helpful tutorial :
library(purrr)
dg.tmp %>%
nest(-GLB, -NDG) %>%
mutate(data = map2(data, NDG, sample_n)) %>%
unnest
One advantage is that it doesn't require the permutation of ALL rows of data as with sample_frac
, which could be quite costly with a large dataframe.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.