简体   繁体   English

Select 基于 R dataframe 的 3 列的 groupby 的一列随机样本

[英]Select random sample of one column based on groupby of 3 columns of a R dataframe [on hold]

I need to select a random sample of one column of a R dataframe by grouping by on three other columns.我需要 select R dataframe 一列的随机样本,方法是在其他三列上分组。 This is some what similar to what has been discussed below:这与下面讨论的内容类似:

Groupby and Sample pandas Groupby 和样品 pandas

and I do not know how to replicate in the Python code in R.而且我不知道如何在 R 中的 Python 代码中复制。

My bad, I haven't posted what i tried so far.我的错,到目前为止,我还没有发布我尝试过的内容。 I used data.table package.我使用了 data.table package。

library(data.table)
sample_df <- df[, .SD[sample(x = .N, size = 50)], by = id]

However, I am not sure how to sample one column by using 3 other columns as groupby但是,我不确定如何通过使用其他 3 列作为 groupby 来对一列进行采样

Added sample masked data添加了样本掩码数据

df:东风:

col1    col2    col3    col4
A1       ABC    1234     H
A1       ABC    1234    O2
A1       ABC    1234    N
B1       DEF    7787J   C
B1       DEF    7787J   CA
C1       HIJ    8989    CL

target df:目标df:

 col1   col2    col3    col4
 A1     ABC     1234    H or O2 or N
 A1     ABC     1234    H or O2 or N
 B1     DEF     7787J   C
 B1     DEF     7787J   CA
 C1     HIJ     8989    CL

Base R solution:基础 R 解决方案:

sample_df <- do.call("rbind", lapply(split(df, df$Position), function(x){if(nrow(x) > 1){sample(x)}else{x}}))

Data:数据:

df <- structure(list(Name = structure(c(4L, 1L, 2L, 6L, 3L, 5L, 4L, 1L, 2L, 3L, 5L, 4L, 1L, 2L, 6L, 3L, 5L, 2L, 6L, 3L, 5L), 
                                              .Label = c("Bob",  "Dave", "Fred", "Jim", "Ray", "Steve"),
                                              class = "factor"), Date = structure(c(1L,  1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
                                                                                    3L, 3L, 3L,  4L, 4L, 4L, 4L), .Label = c("2019-10-19", "2019-10-20", "2019-10-21",  "2019-10-22"), 
                                                                                  class = "factor"), Load = c(900L, 900L, 900L,  850L, 850L, 850L, 789L, 789L, 789L, 960L, 
                                                                                                              960L, 909L, 909L, 909L,  991L, 991L, 991L, 720L, 717L, 717L, 717L), 
                             Position = structure(c(2L,  2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,  1L, 2L, 1L, 1L), 
                                                  .Label = c("Defense", "Forward"), class = "factor")), row.names = c(NA,  -21L), class = "data.frame")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM