[英]Create random groups of 3 rows in R
I'm trying to create as many random groups from a dataset as possible. 我正在尝试从数据集中创建尽可能多的随机组。 My data is kind of complicated to explain so I'll use
iris
for my example. 我的数据很难解释,因此我以
iris
为例。
In iris
, the Species
variable contains 3 unique values: setosa
, versicolor
, and virginica
. 在
iris
, Species
变量包含3个唯一值: setosa
, versicolor
和virginica
。
I want to randomize and group the dataset into groups of 3 rows, with each group containing unique Species only. 我想将数据集随机分为3行,每组仅包含唯一的Species。 (eg. 1 of each Species)
(例如,每个物种1个)
Each group must have a cumsum(Sepal.Width >= 10)
每个组必须有一个
cumsum(Sepal.Width >= 10)
Create a new ID
that identifies each group. 创建一个新
ID
来标识每个组。
So far I've tried using the dplyr function group_by()
and sample_n()
. 到目前为止,我已经尝试使用dplyr函数
group_by()
和sample_n()
。 Also split()
and sample()
, but can't seem to get the desired result. 还要
split()
和sample()
,但似乎无法获得所需的结果。
Using split()
I think might be the wrong way to do it. 我认为使用
split()
可能是错误的方法。 I was trying to make it work along these lines with no luck. 我一直在努力让它沿这条路线运转,但是没有运气。
split(unique(iris), sample(1:nrow(iris) %/% 3))
Try something like this: 尝试这样的事情:
#the sample
N=dim(iris)[1]
n=50 #sample size
set.seed(123)
si=iris[sample(N,n),c("Species","Sepal.Width")]
#the "cumsum"
lim=2.8 #for the conditional sum
Sepal.Width=sapply(split(si,si$Species),function(x)
sum(x$Sepal.Width >= lim))
sol=data.frame(Species=names(Sepal.Width),Sepal.Width)
sol$ID=1:length(sol[,1])
sol
# Species Sepal.Width ID
# setosa setosa 18 1
# versicolor versicolor 8 2
# virginica virginica 14 3
I think I understood the problem. 我想我明白这个问题。 Here's how you could do it using dplyr
这是使用dplyr的方法
First, load some packages and add a unique ID for each row in the iris
data.frame. 首先,加载一些程序包,并为
iris
data.frame中的每一行添加唯一的ID。
library(dplyr)
library(tidyr)
iris = iris %>% mutate(Row.ID=1:n())
Then, let's split the Row.IDs according to species, and get a data.frame with all possible combinations of one row from each species 然后,让我们根据种类拆分Row.ID,并获得一个data.frame,其中包含每个种类的一行的所有可能组合
iris_split = split(iris$Row.ID, iris$Species)
combinations = do.call(expand.grid, iris_split)
Now, it's dplyr
and tidyr
time. 现在是
dplyr
和tidyr
时间。 Let's gather those combinations in a variable called tmp
, join tmp
with the rest of the iris
data.frame and then filter according to the criteria. 让我们将这些组合收集到一个名为
tmp
的变量中,将tmp
与iris
data.frame的其余部分合并,然后根据条件进行过滤。
tmp = combinations %>%
mutate(Group.ID=1:n()) %>%
gather(Var, Row.ID, -Group.ID) %>%
select(-Var)
result = iris %>%
inner_join(tmp) %>%
group_by(Group.ID) %>%
filter(sum(Sepal.Length) > 10) %>%
arrange(Group.ID)
The result
data.frame should be what you're looking for. result
data.frame应该是您想要的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.