简体   繁体   English

在数据帧中为每组采样n个随机行

[英]Sample n random rows per group in a dataframe

From these questions - Random sample of rows from subset of an R dataframe & Sample random rows in dataframe I can easily see how to randomly sample (select) 'n' rows from a df, or 'n' rows that originate from a specific level of a factor within a df. 从这些问题中 - 从R数据帧的子集和数据帧中的 样本随机行中随机抽样行我可以很容易地看到如何从df中随机抽样(选择)'n'行,或者来自特定级别的'n'行df中的因子。

Here are some sample data: 以下是一些示例数据:

df <- data.frame(matrix(rnorm(80), nrow=40))
df$color <-  rep(c("blue", "red", "yellow", "pink"), each=10)

df[sample(nrow(df), 3), ] #samples 3 random rows from df, without replacement.

To eg just sample 3 random rows from 'pink' color - using library(kimisc) : 例如,使用library(kimisc)从“粉红色”颜色中抽取3个随机行:

library(kimisc)
sample.rows(subset(df, color == "pink"), 3)

or writing custom function: 或编写自定义功能:

sample.df <- function(df, n) df[sample(nrow(df), n), , drop = FALSE]
sample.df(subset(df, color == "pink"), 3)

However, I want to sample 3 (or n) random rows from each level of the factor. 但是,我想从每个级别的因子中抽取3(或n)个随机行。 Ie the new df would have 12 rows (3 from blue, 3 from red, 3 from yellow, 3 from pink). 即新的df将有12行(蓝色3个,红色3个,黄色3个,粉红色3个)。 It's obviously possible to run this several times, create newdfs for each color, and then bind them together, but I am looking for a simpler solution. 显然可以多次运行,为每种颜色创建newdf,然后将它们绑定在一起,但我正在寻找一种更简单的解决方案。

In versions of dplyr 0.3 and later, this works just fine: dplyr 0.3及更高版本的版本中,这很好用:

df %>% group_by(color) %>% sample_n(size = 3)

Old versions of dplyr (version <= 0.2) 旧版本的dplyr (版本<= 0.2)

I set out to answer this using dplyr , assuming that this would work: 我开始使用dplyr来回答这个问题 ,假设这会起作用:

df %.% group_by(color) %.% sample_n(size = 3)

But it turns out that in 0.2 the sample_n.grouped_df S3 method exists but isn't registered in the NAMESPACE file, so it's never dispatched. 但事实证明,在0.2中, sample_n.grouped_df S3方法存在,但未在NAMESPACE文件中注册,因此从未调度过。 Instead, I had to do this: 相反,我必须这样做:

df %.% group_by(color) %.% dplyr:::sample_n.grouped_df(size = 3)
Source: local data frame [12 x 3]
Groups: color

            X1         X2  color
8   0.66152710 -0.7767473   blue
1  -0.70293752 -0.2372700   blue
2  -0.46691793 -0.4382669   blue
32 -0.47547565 -1.0179842   pink
31 -0.15254540 -0.6149726   pink
39  0.08135292 -0.2141423   pink
15  0.47721644 -1.5033192    red
16  1.26160230  1.1202527    red
12 -2.18431919  0.2370912    red
24  0.10493757  1.4065835 yellow
21 -0.03950873 -1.1582658 yellow
28 -2.15872261 -1.5499822 yellow

Presumably this will be fixed in a future update. 据推测,这将在未来的更新中修复。

I would consider my stratified function , which is presently hosted as a GitHub Gist. 我会考虑我的stratified函数 ,它目前作为GitHub Gist托管。

Get it with: 得到它:

library(devtools)  ## To download "stratified"
source_gist("https://gist.github.com/mrdwab/6424112")

And use it with: 并使用它:

stratified(df, "color", 3)

There are several different features that are convenient for stratified sampling. 有几种不同的功能便于分层采样。 For instance, you can also take a sample sort of "on the fly". 例如,您也可以采取样品“即时”。

stratified(df, "color", 3, select = list(color = c("blue", "red")))

To give you a sense of what the function does, here are the arguments to stratified : 为了让您了解函数的作用,以下是stratified的参数:

  • df : The input data.frame df :输入data.frame
  • group : A character vector of the column or columns that make up the "strata". group :构成“strata”的一列或多列的字符向量。
  • size : The desired sample size. size :所需的样本大小。
    • If size is a value less than 1, a proportionate sample is taken from each stratum. 如果size是小于1的值,则从每个层中取出一个比例样本。
    • If size is a single integer of 1 or more, that number of samples is taken from each stratum. 如果size是1或更大的单个整数,则从每个层中获取该样本数。
    • If size is a vector of integers, the specified number of samples is taken for each stratum. 如果size是整数向量,则为每个层获取指定数量的样本。 It is recommended that you use a named vector . 建议您使用命名向量 For example, if you have two strata, "A" and "B", and you wanted 5 samples from "A" and 10 from "B", you would enter size = c(A = 5, B = 10) . 例如,如果您有两个层次,“A”和“B”,并且您想要5个样本来自“A”而10个来自“B”,则输入size = c(A = 5, B = 10)
  • select : This allows you to subset the groups in the sampling process. select :这允许您在采样过程中对组进行子集化。 This is a list . 这是一个list For instance, if your group variable was "Group", and it contained three strata, "A", "B", and "C", but you only wanted to sample from "A" and "C", you can use select = list(Group = c("A", "C")) . 例如,如果你的group变量是“组”,它包含三个层,“A”,“B”和“C”,但你只想从“A”和“C”中取样,你可以使用select = list(Group = c("A", "C"))
  • replace : For sampling with replacement. replace :用于替换采样。

Here's a solution. 这是一个解决方案。 We split a data.frame into color groups. 我们将data.frame拆分为颜色组。 Then we sample 3 rows from each group. 然后我们从每组中抽取3行。 This yields a list of data.frames. 这会生成data.frames列表。

df2 <- lapply(split(df, df$color),
   function(subdf) subdf[sample(1:nrow(subdf), 3),]
)

To obtain the desired result, we merge the list of data.frames into 1 data.frame: 为了获得所需的结果,我们将data.frames列表合并为1个data.frame:

do.call('rbind', df2)
##                    X1          X2  color
## blue.3    -1.22677188  1.25648082   blue
## blue.4    -0.54516686 -1.94342967   blue
## blue.1     0.44647071  0.16283326   blue
## pink.40    0.23520296 -0.40411906   pink
## pink.34    0.02033939 -0.32321309   pink
## pink.33   -1.01790533 -1.22618575   pink
## red.16     1.86545895  1.11691250    red
## red.11     1.35748078 -0.36044728    red
## red.13    -0.02425645  0.85335279    red
## yellow.21  1.96728782 -1.81388110 yellow
## yellow.25 -0.48084967  0.07865186 yellow
## yellow.24 -0.07056236 -0.28514125 yellow

You can assign a random ID to each element that has a particular factor level using ave . 您可以使用ave具有特定因子级别的每个元素分配随机ID。 Then you can select all random IDs in a certain range. 然后,您可以选择特定范围内的所有随机ID。

rndid <- with(df, ave(X1, color, FUN=function(x) {sample.int(length(x))}))
df[rndid<=3,]

This has the advantage of preserving the original row order and row names if that's something you are interested in. Plus you can re-use the rndid vector to create subset of different lengths fairly easily. 这样做的好处是保留原始的行顺序和行名称,如果这是您感兴趣的话。另外,您可以相当容易地重复使用rndid向量来创建不同长度的子集。

Here is a way, in base, that allows for multiple groups and sampling with replacement: 这是一种基础方式,允许多个组和替换采样:

n <- 3
resample <- TRUE
index <- 1:nrow(df)
fun <- function(x) sample(x, n, replace = resample)
a <- aggregate(index, by = list(group = df$color), FUN = fun )

df[c(a$x),]

To add another group, include it in the 'by' argument to aggregate. 要添加另一个组,请将其包含在“by”参数中以进行聚合。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM