简体   繁体   English

显示随机样本,而不从主数据框进行子集设置

[英]Display a random sample without subsetting from the main dataframe

I have this final dataset of roughly 150 000 rows per 40 columns that covers all my potential samples from 1932 to 2016, and I need to make a random selection of 53 samples per year for a total number of ~5000. 我有这个最终的数据集,每40列大约15万行,覆盖了我从1932年到2016年的所有潜在样本,我需要每年随机选择53个样本,总数约为5000个。

The selection in itself is really straight forward using the sample() function to get a subset, however I need to display the selection in the original dataframe to be able to check various things. 使用sample()函数来获得子集,选择本身实际上很简单,但是我需要在原始数据帧中显示选择内容,以便能够检查各种情况。 My issue is the following: 我的问题如下:

If I edit one of the fields in my random subset and merge it back with the main one, it creates duplicates that I can't remove because one field changed and thus R considers the two rows aren't duplicates. 如果我在随机子集中编辑一个字段并将其与主字段合并,它将创建无法删除的重复项,因为更改了一个字段,因此R认为两行不是重复项。 If I don't edit anything, I can't find which rows were selected. 如果不进行任何编辑,则找不到选择的行。

My solution for now was to merge everything in Excel instead of R, apply color codes to highlight the selected rows and delete manually the duplicates. 我现在的解决方案是合并Excel中的所有内容而不是R,应用颜色代码突出显示选定的行并手动删除重复项。 However it's time consuming, prone to mistakes and not practicable as the dataset seems to be too big and my PC quickly runs out of memory when I try... 但是,这很耗时,容易出错,而且不可行,因为数据集似乎太大,并且当我尝试时,我的PC会很快用尽内存...

UPDATE: 更新:

Here's a reproducible example: 这是一个可重现的示例:

dat <- data.frame(
  X = sample(2000:2016, 50, replace=TRUE),
  Y = sample(c("yes", "no"), 50, replace = TRUE),
  Z = sample(c("french","german","english"), 50, replace=TRUE)
)

dat2 <- subset(dat, dat$X==2000)                   #samples of year 2000
sc <- dat2[sample(nrow(dat2), 1), ]                #Random selection of 1

What I would like to do is select directly in the dataset (dat1), for example by randomly assigning the value "1" in a column called "selection". 我想做的是直接在数据集中(dat1)中进行选择,例如,通过在名为“选择”的列中随机分配值“ 1”。 Or, if not possible, how can I merge the sampled rows (here called "sc") back to the main dataset but with something indicating they have been sampled 或者,如果不可能的话,如何将采样的行(此处称为“ sc”)合并回到主数据集中,但又带有指示已采样的行

Note: 注意:

I've been using R sporadically for the last 2 years and I'm a fairly inexperienced user, so I apologize if this is a silly question. 在过去的两年中,我一直不时使用R,并且我是一个经验不足的用户,因此,如果这是一个愚蠢的问题,我深表歉意。 I've been roaming Google and SO for the last 3 days and couldn't find any relevant answer yet. 我过去三天一直在Google和SO上漫游,但尚未找到任何相关答案。

I recently got in a PhD program in biology that requires me to handle a lot of data from an archive. 我最近获得了生物学博士学位课程,该课程要求我处理档案中的许多数据。

EDIT: updated based on comments. 编辑:根据评论更新。

You could add a column that indicates if a row is part of your sample. 您可以添加一列,以指示行是否属于样本。 So maybe try the following: 因此,也许尝试以下操作:

df = data.frame(year= c(1,1,1,1,1,1,2,2,2,2,2,2), id=c(1,2,3,4,5,6,7,8,9,10,11,12),age=c(7,7,7,12,12,12,7,7,7,12,12,12))

library(dplyr)
n_per_year_low_age = 2
n_per_year_high_age = 1
df <- df %>% group_by(year) %>% 
  mutate(in_sample1 = as.numeric(id %in% sample(id[age<8],n_per_year_low_age))) %>% 
  mutate(in_sample2 = as.numeric(id %in% sample(id[age>8],n_per_year_high_age))) %>%
  mutate(in_sample = in_sample1+in_sample2) %>%
  select(-in_sample1,-in_sample2)

Output: 输出:

# A tibble: 12 x 4
# Groups: year [2]
    year    id   age in_sample
   <dbl> <dbl> <dbl>     <dbl>
 1  1.00  1.00  7.00      1.00
 2  1.00  2.00  7.00      1.00
 3  1.00  3.00  7.00      0   
 4  1.00  4.00 12.0       1.00
 5  1.00  5.00 12.0       0   
 6  1.00  6.00 12.0       0   
 7  2.00  7.00  7.00      1.00
 8  2.00  8.00  7.00      0   
 9  2.00  9.00  7.00      1.00
10  2.00 10.0  12.0       0   
11  2.00 11.0  12.0       0   
12  2.00 12.0  12.0       1.00

Futher operations are then trivial: 这样,进一步的操作就变得微不足道了:

# extracting your sample
df %>% filter(in_sample==1)
# comparing statistics of your sample against the rest of the population
df %>% group_by(year,in_sample) %>% summarize(mean(id))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM