[英]Create a random subsample by ID and with a certain factor distribution in R
我正在使用 R 并拥有以下数据集,其中包含从书中取出的句子,并包含有关书籍 ID、封面颜色(颜色)和与相应书籍匹配的句子 ID 的数据。
My dataset
Book ID| sentence ID| Colour | Sentences
1 | 1 | Blue | Text goes here
1 | 2 | Blue | Text goes here
1 | 3 | Blue | Text goes here
2 | 4 | Red | Text goes here
2 | 5 | Red | Text goes here
3 | 6 | Green | Text goes here
4 | 7 | Orange | Text goes here
4 | 8 | Orange | Text goes here
4 | 9 | Orange | Text goes here
4 | 10 | Orange | Text goes here
4 | 11 | Orange | Text goes here
5 | 12 | Blue | Text goes here
5 | 13 | Blue | Text goes here
6 | 14 | Red | Text goes here
6 | 15 | Red | Text goes here
.
我想在以下条件下抽取四个随机子样本(每个包含 25% 的原始数据):
1)书籍颜色的分布应与原始数据集中的分布相同。 如果有 10% 的蓝皮书,这也应该反映在子样本中
2)子样本不应按行数(即句子ID)而是按“书ID”来获取/拆分。 这意味着如果对 Book ID 4 进行采样,则所有句子 7、8、9、10、11 都应该在样本数据集中。
3) 此外,每个 Book ID 应该只在 4 个子样本之一中 - 这意味着如果我决定合并所有 4 个子样本,我想再次使用原始数据集。
以上述方式拆分我的数据集的最佳解决方案是什么?
这应该有效。 书籍按颜色分组,然后从长度为 4 的下一个倍数的池中抽取 1:4 的数字,以确保均匀分布。 然后将数据框按样本编号拆分。
library(readr)
library(dplyr)
library(tidyr)
books <- read_delim(
'Book ID| sentence ID| Colour | Sentences
1 | 1 | Blue | Text goes here
1 | 2 | Blue | Text goes here
1 | 3 | Blue | Text goes here
2 | 4 | Red | Text goes here
2 | 5 | Red | Text goes here
3 | 6 | Green | Text goes here
4 | 7 | Orange | Text goes here
4 | 8 | Orange | Text goes here
4 | 9 | Orange | Text goes here
4 | 10 | Orange | Text goes here
4 | 11 | Orange | Text goes here
5 | 12 | Blue | Text goes here
5 | 13 | Blue | Text goes here
6 | 14 | Red | Text goes here
6 | 15 | Red | Text goes here',
'|', trim_ws = TRUE)
books %>%
# sampling is done on a book ID level. We group by book
# and nest the sentences, to get only one row per book.
group_by(`Book ID`) %>%
nest(book_data = c(`sentence ID`, Sentences)) %>%
# We want to split colours evenly. We therefore draw a sample number from 1:4
# for each group of colours. To ensure an even split, we draw from a
# vector that is a repeat of 1:4 until it has a lenght, that is the
# first multiple of 4, that is >= the number of colours in a group.
group_by(Colour) %>%
mutate(sample = sample(rep_len(1:4, (n() + 3) %/% 4 * 4 ), n(), replace = F)) %>%
# Unnest the sentences again.
unnest(book_data) %>%
# Split the data frame into lists by the sample number.
split(.$sample)
$`1`
# A tibble: 4 x 5
# Groups: Colour [2]
`Book ID` Colour `sentence ID` Sentences sample
<dbl> <chr> <dbl> <chr> <int>
1 5 Blue 12 Text goes here 1
2 5 Blue 13 Text goes here 1
3 6 Red 14 Text goes here 1
4 6 Red 15 Text goes here 1
$`2`
# A tibble: 2 x 5
# Groups: Colour [1]
`Book ID` Colour `sentence ID` Sentences sample
<dbl> <chr> <dbl> <chr> <int>
1 2 Red 4 Text goes here 2
2 2 Red 5 Text goes here 2
$`3`
# A tibble: 1 x 5
# Groups: Colour [1]
`Book ID` Colour `sentence ID` Sentences sample
<dbl> <chr> <dbl> <chr> <int>
1 3 Green 6 Text goes here 3
$`4`
# A tibble: 8 x 5
# Groups: Colour [2]
`Book ID` Colour `sentence ID` Sentences sample
<dbl> <chr> <dbl> <chr> <int>
1 1 Blue 1 Text goes here 4
2 1 Blue 2 Text goes here 4
3 1 Blue 3 Text goes here 4
4 4 Orange 7 Text goes here 4
5 4 Orange 8 Text goes here 4
6 4 Orange 9 Text goes here 4
7 4 Orange 10 Text goes here 4
8 4 Orange 11 Text goes here 4
这里是简短的版本:
library(tidyverse)
df <- tribble(
~Book_ID, ~sentence_ID, ~Colour, ~Sentences
,1 , 1, "Blue", "Text goes here"
,1 , 2, "Blue", "Text goes here"
,1 , 3, "Blue", "Text goes here"
,2 , 4, "Red", "Text goes here"
,2 , 5, "Red", "Text goes here"
,3 , 6, "Green", "Text goes here"
,4 , 7, "Orange", "Text goes here"
,4 , 8, "Orange", "Text goes here"
,4 , 9, "Orange", "Text goes here"
,4 , 10, "Orange", "Text goes here"
,4 , 11, "Orange", "Text goes here"
,5 , 12, "Blue", "Text goes here"
,5 , 13, "Blue", "Text goes here"
,6 , 14, "Red", "Text goes here"
,6 , 15, "Red", "Text goes here"
)
df %>%
left_join(
df %>%
distinct(Book_ID, Colour) %>%
group_by(Colour) %>%
mutate(sub_sample = sample.int(4, size = n(), replace = TRUE))
, by = c("Book_ID", "Colour"))
这会给你:
# A tibble: 15 x 5
Book_ID sentence_ID Colour Sentences sub_sample
<dbl> <dbl> <chr> <chr> <int>
1 1 1 Blue "Text goes here" 3
2 1 2 Blue "Text goes here" 3
3 1 3 Blue "Text goes here" 3
4 2 4 Red "Text goes here" 1
5 2 5 Red "Text goes here" 1
6 3 6 Green "Text goes here" 1
7 4 7 Orange "Text goes here" 2
8 4 8 Orange "Text goes here" 2
9 4 9 Orange "Text goes here" 2
10 4 10 Orange "Text goes here" 2
11 4 11 Orange "Text goes here" 2
12 5 12 Blue "Text goes here" 2
13 5 13 Blue "Text goes here" 2
14 6 14 Red "Text goes here" 3
15 6 15 Red "Text goes here" 3
以及代码的简短说明:
让我们从嵌套部分开始
# take the dataframe
df %>%
# ...and extract the distinct combinations of book and colour
distinct(Book_ID, Colour) %>%
# and now for each colour...
group_by(Colour) %>%
# ...provide random numbers from 1 to 4
mutate(sub_sample = sample.int(4, size = n(), replace = TRUE))
按颜色分组可确保您在每个样本中具有相同的颜色分布。
这个结果现在是我们之前“区分”的两列上的原始 dataframe 的left_join
ed - 这确保不会有重复。
一项补充
要在子样本中获得相同的颜色分布,您当然需要为每种颜色提供足够数量的书籍。 因此,例如,只有 20 种不同的绿色书籍可以保证以不同的方式分发。 在这种情况下,您可能希望为采样“分组”颜色。 然而,这是一个统计问题,显然超出了编程论坛的 scope 范围。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.