简体   繁体   English

来自R中数据帧的随机样本

[英]Random sample from a data frame in R

I have the following data frame: 我有以下数据框:

id<-c(1,1,2,3,3)
date<-c("23-01-08","01-11-07","30-11-07","17-12-07","12-12-08")
df<-data.frame(id,date)
df$date2<-as.Date(as.character(df$date), format = "%d-%m-%y")

id     date      date2
1   23-01-08 2008-01-23
1   01-11-07 2007-11-01
2   30-11-07 2007-11-30
3   17-12-07 2007-12-17
3   12-12-08 2008-12-12

Now I want to extract a random sample of ids and not the rows. 现在我想提取一个随机的id样本,而不是行。 In fact I am looking for a way to randomly pick two of the ids and extract all records related to them. 事实上,我正在寻找一种方法来随机选择两个ID并提取与它们相关的所有记录。 For instance if it randomly pick ids 2 and 3 the output data frame should look like: 例如,如果它随机选择ID 2和3,则输出数据框应如下所示:

id     date      date2
2   30-11-07 2007-11-30
3   17-12-07 2007-12-17
3   12-12-08 2008-12-12

Any helps would be appreciated. 任何帮助将不胜感激。

You can randomly pick two IDs using sample() 您可以使用sample()随机选择两个ID

chosen <- sample(unique(df$id), 2)

and then extract those records 然后提取这些记录

subset(df, id %in% chosen)

First you have to generate the sample indexes: 首先,您必须生成示例索引:

s_ids=sample(unique(df$id),2)

now that you have that you select the proper records in your df 既然您已经在df中选择了正确的记录

new_df=df$[df$id %in% s_ids,]

You can use sample function. 您可以使用sample功能。

set.seed(2)
df[match(sample(unique(df$id),2),df$id),]

sample() function will generate random indexes and then you can match them back to your df data frame rows and get the rest of the data. sample()函数将生成随机索引,然后您可以将它们与df数据帧行匹配并获取其余数据。 For more information check ?sample 有关更多信息,请查看?sample

Or using dplyr 或者使用dplyr

library(dplyr)
df %>% 
    filter(id %in% sample(unique(id),2))
#  id     date      date2
#1  2 30-11-07 2007-11-30
#2  3 17-12-07 2007-12-17
#3  3 12-12-08 2008-12-12

Or 要么

df %>%
     select(id) %>%
     unique() %>%
     sample_n(2) %>%
     semi_join(df, .)
#  id     date      date2
#1  1 23-01-08 2008-01-23
#2  1 01-11-07 2007-11-01
#3  2 30-11-07 2007-11-30

Using sqldf: 使用sqldf:

library(sqldf)
a <- sqldf("SELECT DISTINCT id FROM df  ORDER BY RANDOM(*) LIMIT 2")
sqldf("SELECT * FROM df WHERE id IN a")

Ouput: 输出继电器:

  id     date      date2
1  1 23-01-08 2008-01-23
2  1 01-11-07 2007-11-01
3  3 17-12-07 2007-12-17
4  3 12-12-08 2008-12-12

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM