[英]Sample n random rows per group in a dataframe
From these questions - Random sample of rows from subset of an R dataframe & Sample random rows in dataframe I can easily see how to randomly sample (select) 'n' rows from a df, or 'n' rows that originate from a specific level of a factor within a df. 从这些问题中 - 从R数据帧的子集和数据帧中的 样本随机行中随机抽样行我可以很容易地看到如何从df中随机抽样(选择)'n'行,或者来自特定级别的'n'行df中的因子。
Here are some sample data: 以下是一些示例数据:
df <- data.frame(matrix(rnorm(80), nrow=40))
df$color <- rep(c("blue", "red", "yellow", "pink"), each=10)
df[sample(nrow(df), 3), ] #samples 3 random rows from df, without replacement.
To eg just sample 3 random rows from 'pink' color - using library(kimisc)
: 例如,使用library(kimisc)
从“粉红色”颜色中抽取3个随机行:
library(kimisc)
sample.rows(subset(df, color == "pink"), 3)
or writing custom function: 或编写自定义功能:
sample.df <- function(df, n) df[sample(nrow(df), n), , drop = FALSE]
sample.df(subset(df, color == "pink"), 3)
However, I want to sample 3 (or n) random rows from each level of the factor. 但是,我想从每个级别的因子中抽取3(或n)个随机行。 Ie the new df would have 12 rows (3 from blue, 3 from red, 3 from yellow, 3 from pink). 即新的df将有12行(蓝色3个,红色3个,黄色3个,粉红色3个)。 It's obviously possible to run this several times, create newdfs for each color, and then bind them together, but I am looking for a simpler solution. 显然可以多次运行,为每种颜色创建newdf,然后将它们绑定在一起,但我正在寻找一种更简单的解决方案。
In versions of dplyr
0.3 and later, this works just fine: 在dplyr
0.3及更高版本的版本中,这很好用:
df %>% group_by(color) %>% sample_n(size = 3)
dplyr
(version <= 0.2) 旧版本的dplyr
(版本<= 0.2) I set out to answer this using dplyr , assuming that this would work: 我开始使用dplyr来回答这个问题 ,假设这会起作用:
df %.% group_by(color) %.% sample_n(size = 3)
But it turns out that in 0.2 the sample_n.grouped_df
S3 method exists but isn't registered in the NAMESPACE file, so it's never dispatched. 但事实证明,在0.2中, sample_n.grouped_df
S3方法存在,但未在NAMESPACE文件中注册,因此从未调度过。 Instead, I had to do this: 相反,我必须这样做:
df %.% group_by(color) %.% dplyr:::sample_n.grouped_df(size = 3)
Source: local data frame [12 x 3]
Groups: color
X1 X2 color
8 0.66152710 -0.7767473 blue
1 -0.70293752 -0.2372700 blue
2 -0.46691793 -0.4382669 blue
32 -0.47547565 -1.0179842 pink
31 -0.15254540 -0.6149726 pink
39 0.08135292 -0.2141423 pink
15 0.47721644 -1.5033192 red
16 1.26160230 1.1202527 red
12 -2.18431919 0.2370912 red
24 0.10493757 1.4065835 yellow
21 -0.03950873 -1.1582658 yellow
28 -2.15872261 -1.5499822 yellow
Presumably this will be fixed in a future update. 据推测,这将在未来的更新中修复。
I would consider my stratified
function , which is presently hosted as a GitHub Gist. 我会考虑我的stratified
函数 ,它目前作为GitHub Gist托管。
Get it with: 得到它:
library(devtools) ## To download "stratified"
source_gist("https://gist.github.com/mrdwab/6424112")
And use it with: 并使用它:
stratified(df, "color", 3)
There are several different features that are convenient for stratified sampling. 有几种不同的功能便于分层采样。 For instance, you can also take a sample sort of "on the fly". 例如,您也可以采取样品“即时”。
stratified(df, "color", 3, select = list(color = c("blue", "red")))
To give you a sense of what the function does, here are the arguments to stratified
: 为了让您了解函数的作用,以下是stratified
的参数:
df
: The input data.frame
df
:输入data.frame
group
: A character vector of the column or columns that make up the "strata". group
:构成“strata”的一列或多列的字符向量。 size
: The desired sample size. size
:所需的样本大小。
size
is a value less than 1, a proportionate sample is taken from each stratum. 如果size
是小于1的值,则从每个层中取出一个比例样本。 size
is a single integer of 1 or more, that number of samples is taken from each stratum. 如果size
是1或更大的单个整数,则从每个层中获取该样本数。 size
is a vector of integers, the specified number of samples is taken for each stratum. 如果size
是整数向量,则为每个层获取指定数量的样本。 It is recommended that you use a named vector . 建议您使用命名向量 。 For example, if you have two strata, "A" and "B", and you wanted 5 samples from "A" and 10 from "B", you would enter size = c(A = 5, B = 10)
. 例如,如果您有两个层次,“A”和“B”,并且您想要5个样本来自“A”而10个来自“B”,则输入size = c(A = 5, B = 10)
。 select
: This allows you to subset the groups in the sampling process. select
:这允许您在采样过程中对组进行子集化。 This is a list
. 这是一个list
。 For instance, if your group
variable was "Group", and it contained three strata, "A", "B", and "C", but you only wanted to sample from "A" and "C", you can use select = list(Group = c("A", "C"))
. 例如,如果你的group
变量是“组”,它包含三个层,“A”,“B”和“C”,但你只想从“A”和“C”中取样,你可以使用select = list(Group = c("A", "C"))
。 replace
: For sampling with replacement. replace
:用于替换采样。 Here's a solution. 这是一个解决方案。 We split a data.frame into color groups. 我们将data.frame拆分为颜色组。 Then we sample 3 rows from each group. 然后我们从每组中抽取3行。 This yields a list of data.frames. 这会生成data.frames列表。
df2 <- lapply(split(df, df$color),
function(subdf) subdf[sample(1:nrow(subdf), 3),]
)
To obtain the desired result, we merge the list of data.frames into 1 data.frame: 为了获得所需的结果,我们将data.frames列表合并为1个data.frame:
do.call('rbind', df2)
## X1 X2 color
## blue.3 -1.22677188 1.25648082 blue
## blue.4 -0.54516686 -1.94342967 blue
## blue.1 0.44647071 0.16283326 blue
## pink.40 0.23520296 -0.40411906 pink
## pink.34 0.02033939 -0.32321309 pink
## pink.33 -1.01790533 -1.22618575 pink
## red.16 1.86545895 1.11691250 red
## red.11 1.35748078 -0.36044728 red
## red.13 -0.02425645 0.85335279 red
## yellow.21 1.96728782 -1.81388110 yellow
## yellow.25 -0.48084967 0.07865186 yellow
## yellow.24 -0.07056236 -0.28514125 yellow
You can assign a random ID to each element that has a particular factor level using ave
. 您可以使用ave
具有特定因子级别的每个元素分配随机ID。 Then you can select all random IDs in a certain range. 然后,您可以选择特定范围内的所有随机ID。
rndid <- with(df, ave(X1, color, FUN=function(x) {sample.int(length(x))}))
df[rndid<=3,]
This has the advantage of preserving the original row order and row names if that's something you are interested in. Plus you can re-use the rndid
vector to create subset of different lengths fairly easily. 这样做的好处是保留原始的行顺序和行名称,如果这是您感兴趣的话。另外,您可以相当容易地重复使用rndid
向量来创建不同长度的子集。
Here is a way, in base, that allows for multiple groups and sampling with replacement: 这是一种基础方式,允许多个组和替换采样:
n <- 3
resample <- TRUE
index <- 1:nrow(df)
fun <- function(x) sample(x, n, replace = resample)
a <- aggregate(index, by = list(group = df$color), FUN = fun )
df[c(a$x),]
To add another group, include it in the 'by' argument to aggregate. 要添加另一个组,请将其包含在“by”参数中以进行聚合。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.