在 R 中按 ID 和特定因子分布创建随机子样本

Question

我正在使用 R 并拥有以下数据集，其中包含从书中取出的句子，并包含有关书籍 ID、封面颜色（颜色）和与相应书籍匹配的句子 ID 的数据。

My dataset
    Book ID| sentence ID| Colour      | Sentences
    1      | 1          | Blue        | Text goes here
    1      | 2          | Blue        | Text goes here
    1      | 3          | Blue        | Text goes here
    2      | 4          | Red         | Text goes here
    2      | 5          | Red         | Text goes here
    3      | 6          | Green       | Text goes here
    4      | 7          | Orange      | Text goes here
    4      | 8          | Orange      | Text goes here
    4      | 9          | Orange      | Text goes here
    4      | 10         | Orange      | Text goes here
    4      | 11         | Orange      | Text goes here
    5      | 12         | Blue        | Text goes here
    5      | 13         | Blue        | Text goes here
    6      | 14         | Red         | Text goes here
    6      | 15         | Red         | Text goes here
    .

我想在以下条件下抽取四个随机子样本（每个包含 25% 的原始数据）：

1）书籍颜色的分布应与原始数据集中的分布相同。 如果有 10% 的蓝皮书，这也应该反映在子样本中

2）子样本不应按行数（即句子ID）而是按“书ID”来获取/拆分。 这意味着如果对 Book ID 4 进行采样，则所有句子 7、8、9、10、11 都应该在样本数据集中。

3) 此外，每个 Book ID 应该只在 4 个子样本之一中 - 这意味着如果我决定合并所有 4 个子样本，我想再次使用原始数据集。

以上述方式拆分我的数据集的最佳解决方案是什么？

Answer 1

这应该有效。 书籍按颜色分组，然后从长度为 4 的下一个倍数的池中抽取 1:4 的数字，以确保均匀分布。 然后将数据框按样本编号拆分。

library(readr)
library(dplyr)
library(tidyr)

books <- read_delim(
'Book ID| sentence ID| Colour      | Sentences
    1      | 1          | Blue        | Text goes here
    1      | 2          | Blue        | Text goes here
    1      | 3          | Blue        | Text goes here
    2      | 4          | Red         | Text goes here
    2      | 5          | Red         | Text goes here
    3      | 6          | Green       | Text goes here
    4      | 7          | Orange      | Text goes here
    4      | 8          | Orange      | Text goes here
    4      | 9          | Orange      | Text goes here
    4      | 10         | Orange      | Text goes here
    4      | 11         | Orange      | Text goes here
    5      | 12         | Blue        | Text goes here
    5      | 13         | Blue        | Text goes here
    6      | 14         | Red         | Text goes here
    6      | 15         | Red         | Text goes here', 
'|', trim_ws = TRUE)

books %>%
  # sampling is done on a book ID level. We group by book 
  # and nest the sentences, to get only one row per book.
  group_by(`Book ID`) %>% 
  nest(book_data = c(`sentence ID`, Sentences)) %>% 

  # We want to split colours evenly. We therefore draw a sample number from 1:4
  # for each group of colours. To ensure an even split, we draw from a 
  # vector that is a repeat of 1:4 until it has a lenght, that is the 
  # first multiple of 4, that is >= the number of colours in a group.
  group_by(Colour) %>%
  mutate(sample = sample(rep_len(1:4, (n() + 3) %/% 4 * 4 ), n(), replace = F)) %>% 

  # Unnest the sentences again.
  unnest(book_data) %>% 

  # Split the data frame into lists by the sample number.
  split(.$sample)

$`1`
# A tibble: 4 x 5
# Groups:   Colour [2]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         5 Blue              12 Text goes here      1
2         5 Blue              13 Text goes here      1
3         6 Red               14 Text goes here      1
4         6 Red               15 Text goes here      1

$`2`
# A tibble: 2 x 5
# Groups:   Colour [1]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         2 Red                4 Text goes here      2
2         2 Red                5 Text goes here      2

$`3`
# A tibble: 1 x 5
# Groups:   Colour [1]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         3 Green              6 Text goes here      3

$`4`
# A tibble: 8 x 5
# Groups:   Colour [2]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         1 Blue               1 Text goes here      4
2         1 Blue               2 Text goes here      4
3         1 Blue               3 Text goes here      4
4         4 Orange             7 Text goes here      4
5         4 Orange             8 Text goes here      4
6         4 Orange             9 Text goes here      4
7         4 Orange            10 Text goes here      4
8         4 Orange            11 Text goes here      4

Answer 2

这里是简短的版本：

library(tidyverse)

df <- tribble(
    ~Book_ID, ~sentence_ID, ~Colour, ~Sentences
    ,1      , 1, "Blue", "Text goes here"
    ,1      , 2, "Blue", "Text goes here"
    ,1      , 3, "Blue", "Text goes here"
    ,2      , 4, "Red", "Text goes here"
    ,2      , 5, "Red", "Text goes here"
    ,3      , 6, "Green", "Text goes here"
    ,4      , 7, "Orange", "Text goes here"
    ,4      , 8, "Orange", "Text goes here"
    ,4      , 9, "Orange", "Text goes here"
    ,4      , 10, "Orange", "Text goes here"
    ,4      , 11, "Orange", "Text goes here"
    ,5      , 12, "Blue", "Text goes here"
    ,5      , 13, "Blue", "Text goes here"
    ,6      , 14, "Red", "Text goes here"
    ,6      , 15, "Red", "Text goes here"
)

df %>%
    left_join(
        df %>%
            distinct(Book_ID, Colour) %>%
            group_by(Colour) %>%
            mutate(sub_sample = sample.int(4, size = n(), replace = TRUE))
        , by = c("Book_ID", "Colour"))

这会给你：

# A tibble: 15 x 5
   Book_ID sentence_ID Colour Sentences        sub_sample
     <dbl>       <dbl> <chr>  <chr>                 <int>
 1       1           1 Blue   "Text goes here"          3
 2       1           2 Blue   "Text goes here"          3
 3       1           3 Blue   "Text goes here"          3
 4       2           4 Red    "Text goes here"          1
 5       2           5 Red    "Text goes here"          1
 6       3           6 Green  "Text goes here"          1
 7       4           7 Orange "Text goes here"          2
 8       4           8 Orange "Text goes here"          2
 9       4           9 Orange "Text goes here"          2
10       4          10 Orange "Text goes here"          2
11       4          11 Orange "Text goes here"          2
12       5          12 Blue   "Text goes here"          2
13       5          13 Blue   "Text goes here"          2
14       6          14 Red    "Text goes here"          3
15       6          15 Red    "Text goes here"          3

以及代码的简短说明：

让我们从嵌套部分开始

# take the dataframe
df %>%
    # ...and extract the distinct combinations of book and colour
    distinct(Book_ID, Colour) %>%
    # and now for each colour...
    group_by(Colour) %>%
    # ...provide random numbers from 1 to 4
    mutate(sub_sample = sample.int(4, size = n(), replace = TRUE))

按颜色分组可确保您在每个样本中具有相同的颜色分布。

这个结果现在是我们之前“区分”的两列上的原始 dataframe 的left_join ed - 这确保不会有重复。

一项补充

要在子样本中获得相同的颜色分布，您当然需要为每种颜色提供足够数量的书籍。 因此，例如，只有 20 种不同的绿色书籍可以保证以不同的方式分发。 在这种情况下，您可能希望为采样“分组”颜色。 然而，这是一个统计问题，显然超出了编程论坛的 scope 范围。

在 R 中按 ID 和特定因子分布创建随机子样本

问题描述

2 个解决方案

解决方案1
1 2020-06-24 07:48:26

解决方案2
1 已采纳 2020-06-25 08:00:14

在 R 中按 ID 和特定因子分布创建随机子样本

问题描述

2 个解决方案

解决方案1 1 2020-06-24 07:48:26

解决方案2 1 已采纳 2020-06-25 08:00:14

解决方案1
1 2020-06-24 07:48:26

解决方案2
1 已采纳 2020-06-25 08:00:14