简体   繁体   English

视情况拆分dataframe

[英]Splitting dataframe depending on conditions

Let's say you have the following data frame假设您有以下数据框

set.seed(12345)
people <- data.frame(Name = paste("Name", 1:51),
                     Var1 = sample(c("A", "B"), 51, prob = c(0.3, 0.7), replace = TRUE),
                     Var2 = sample(1:2, 51, replace = TRUE))
table(people$Var1, people$Var2)

     1  2
  A 12  5
  B 21 13

I would like to split the dataset into groups depending on certain criteria.我想根据某些标准将数据集分成几组。

For example, I might want to divide the dataset into 9 groups, so that each one has at least 1 person with Var1 == 'A' and a roughly equal balance between 1 and 2 for Var2 .例如,我可能想将数据集分成 9 组,这样每个组中至少有 1 个人的Var1 == 'A'并且Var2的 1 和 2 之间的平衡大致相等。

Obviously, an exact split is not possible so, in this example, I would allocate 5 people to each group and then allocate the rest randomly, in order to have either 5 or 6 people in each group.显然,不可能进行精确的拆分,因此,在此示例中,我将为每个组分配 5 人,然后随机分配 rest,以便每组有 5 人或 6 人。

Is there an efficient way of doing this?有没有一种有效的方法来做到这一点?

PS: I am asking how to do this in R, as I already have these data in R, but a generic solution would be appreciated as well PS:我在问如何在 R 中执行此操作,因为我已经在 R 中有这些数据,但也可以使用通用解决方案

Update更新

Here is an update which may fit your goal, where an algorithm like water-filling is applied to sample rows dynamically according to updated groups.这是一个可能符合您目标的更新,其中根据更新的组动态地将诸如注水之类的算法应用于样本行。

ngrp <- 9
dfa <- subset(people, Var1 == "A")
dfb <- subset(people, Var1 == "B")

dfa_gr <- transform(
  dfa,
  grp = ave(Var2, Var2, FUN = function(x) {
    sample(
      rep(seq(ngrp),
        length.out = length(x)
      ), length(x)
    )
  })
)

lst <- split(subset(dfa_gr, select = -grp), dfa_gr$grp)

while (nrow(dfb) > 0) {
  k <- which.min(sapply(lst, nrow))
  tofill <- c(1:2)[which.min(table(factor(lst[[k]]$Var2, levels = 1:2)))]
  vb <- subset(dfb, Var2 == tofill)
  if (nrow(vb) > 0) {
    rm <- sample(row.names(vb), 1)
  } else {
    rm <- sample(row.names(dfb), 1)
  }
  lst[[k]] <- rbind(lst[[k]], dfb[rm, ])
  dfb <- dfb[row.names(dfb) != rm, ]
}

which gives这使

> lst
$`1`
      Name Var1 Var2
2   Name 2    A    2
9   Name 9    A    1
51 Name 51    A    1
49 Name 49    B    2
21 Name 21    B    1
29 Name 29    B    1

$`2`
      Name Var1 Var2
1   Name 1    A    1
13 Name 13    A    1
20 Name 20    A    2
19 Name 19    B    2
14 Name 14    B    1
32 Name 32    B    1

$`3`
      Name Var1 Var2
10 Name 10    A    1
43 Name 43    A    1
44 Name 44    A    2
30 Name 30    B    2
36 Name 36    B    1
34 Name 34    B    1

$`4`
      Name Var1 Var2
3   Name 3    A    1
23 Name 23    A    2
7   Name 7    B    1
45 Name 45    B    2
6   Name 6    B    1
17 Name 17    B    1

$`5`
      Name Var1 Var2
37 Name 37    A    2
41 Name 41    A    1
40 Name 40    B    1
25 Name 25    B    2
16 Name 16    B    1
22 Name 22    B    1

$`6`
      Name Var1 Var2
31 Name 31    A    1
42 Name 42    B    2
15 Name 15    B    1
48 Name 48    B    2
35 Name 35    B    1
8   Name 8    B    1

$`7`
      Name Var1 Var2
24 Name 24    A    1
27 Name 27    B    2
28 Name 28    B    1
46 Name 46    B    2
50 Name 50    B    1

$`8`
      Name Var1 Var2
38 Name 38    A    1
33 Name 33    B    2
18 Name 18    B    1
26 Name 26    B    2
39 Name 39    B    1

$`9`
      Name Var1 Var2
4   Name 4    A    1
11 Name 11    B    2
47 Name 47    B    1
12 Name 12    B    2
5   Name 5    B    1

Here is an attempt to group rows randomly, which has at least one Var1==A in each group and tries to have close size among groups.这是随机分组行的尝试,每组至少有一个Var1==A并尝试在组之间具有接近的大小。 However, I didn't get the meaning of this objective:但是,我没有理解这个目标的含义:

roughly equal balance between 1 and 2 for Var2 Var2在 1 和 2 之间大致相等的平衡

You have uneven numbers of 1 and 2 so it seems difficulty to have even distribution of them.您的12的数量是奇数,因此它们似乎很难均匀分布。 Or, could you explain a bit on that?或者,你能解释一下吗?


Below is one option, maybe close to your goal:以下是一种选择,可能接近您的目标:

ngrp <- 9
z <- do.call(
  rbind,
  c(
    make.row.names = FALSE,
    lapply(
      with(people, split(people, Var1)),
      function(v) {
        v <- v[order(v$Var2), ]
        transform(
          v,
          grp = sample(
            rep(seq(ngrp),
              length.out = nrow(v)
            ), nrow(v)
          )
        )
      }
    )
  )
)
res <- with(z, split(z, grp))

which gives这使

> res
$`1`
      Name Var1 Var2 grp
5  Name 10    A    1   1
7  Name 24    A    1   1
18  Name 5    B    1   1
19  Name 6    B    1   1
28 Name 22    B    1   1
48 Name 45    B    2   1

$`2`
      Name Var1 Var2 grp
1   Name 1    A    1   2
12 Name 51    A    1   2
32 Name 34    B    1   2
34 Name 36    B    1   2
39 Name 11    B    2   2
41 Name 19    B    2   2

$`3`
      Name Var1 Var2 grp
6  Name 13    A    1   3
17 Name 44    A    2   3
25 Name 17    B    1   3
29 Name 28    B    1   3
33 Name 35    B    1   3
49 Name 46    B    2   3

$`4`
      Name Var1 Var2 grp
2   Name 3    A    1   4
14 Name 20    A    2   4
22 Name 14    B    1   4
35 Name 39    B    1   4
50 Name 48    B    2   4
51 Name 49    B    2   4

$`5`
      Name Var1 Var2 grp
9  Name 38    A    1   5
15 Name 23    A    2   5
23 Name 15    B    1   5
27 Name 21    B    1   5
43 Name 26    B    2   5
44 Name 27    B    2   5

$`6`
      Name Var1 Var2 grp
13  Name 2    A    2   6
16 Name 37    A    2   6
31 Name 32    B    1   6
37 Name 47    B    1   6
40 Name 12    B    2   6
47 Name 42    B    2   6

$`7`
      Name Var1 Var2 grp
8  Name 31    A    1   7
10 Name 41    A    1   7
20  Name 7    B    1   7
30 Name 29    B    1   7
45 Name 30    B    2   7
46 Name 33    B    2   7

$`8`
      Name Var1 Var2 grp
4   Name 9    A    1   8
11 Name 43    A    1   8
24 Name 16    B    1   8
38 Name 50    B    1   8
42 Name 25    B    2   8

$`9`
      Name Var1 Var2 grp
3   Name 4    A    1   9
21  Name 8    B    1   9
26 Name 18    B    1   9
36 Name 40    B    1   9

A simple approach with dplyr : dplyr的简单方法:

  • first order the dataset randomly首先随机排序数据集
  • then order by Var1,Var2然后按Var1,Var2排序
  • select n=9 first people (with Var1 == 'A' because of order) select n=9第一人( Var1 == 'A'因为顺序)
  • sort the remaining people by Var2 and dispatch them in the groups余下的人按Var2排序,分组分派
library(dplyr)

n <- 9

data <- people[sample(nrow(people),replace=F),] %>% arrange(Var1,Var2)

rbind(head(data, n) %>% mutate(grp = 1:n),
      tail(data,-n) %>% arrange(Var2) %>%
                        mutate(grp = rep(1:n,length.out=nrow(people)-n))
     ) %>% split(.$grp)

$`1`
      Name Var1 Var2 grp
1  Name 43    A    1   1
10 Name 31    A    1   1
19 Name 22    B    1   1
28 Name 21    B    1   1
37 Name 23    A    2   1
46 Name 46    B    2   1

$`2`
      Name Var1 Var2 grp
2   Name 4    A    1   2
11 Name 51    A    1   2
20 Name 17    B    1   2
29 Name 14    B    1   2
38 Name 37    A    2   2
47 Name 49    B    2   2

$`3`
      Name Var1 Var2 grp
3   Name 3    A    1   3
12 Name 13    A    1   3
21  Name 5    B    1   3
30 Name 36    B    1   3
39 Name 11    B    2   3
48 Name 33    B    2   3

$`4`
      Name Var1 Var2 grp
4  Name 10    A    1   4
13 Name 15    B    1   4
22 Name 47    B    1   4
31  Name 8    B    1   4
40 Name 42    B    2   4
49 Name 19    B    2   4

$`5`
      Name Var1 Var2 grp
5   Name 1    A    1   5
14  Name 7    B    1   5
23 Name 34    B    1   5
32 Name 35    B    1   5
41 Name 26    B    2   5
50 Name 30    B    2   5

$`6`
      Name Var1 Var2 grp
6  Name 41    A    1   6
15 Name 50    B    1   6
24 Name 29    B    1   6
33 Name 16    B    1   6
42 Name 48    B    2   6
51 Name 27    B    2   6

$`7`
      Name Var1 Var2 grp
7   Name 9    A    1   7
16 Name 28    B    1   7
25 Name 40    B    1   7
34 Name 44    A    2   7
43 Name 45    B    2   7

$`8`
      Name Var1 Var2 grp
8  Name 38    A    1   8
17 Name 32    B    1   8
26 Name 39    B    1   8
35 Name 20    A    2   8
44 Name 25    B    2   8

$`9`
      Name Var1 Var2 grp
9  Name 24    A    1   9
18 Name 18    B    1   9
27  Name 6    B    1   9
36  Name 2    A    2   9
45 Name 12    B    2   9

Not sure if this will suit your application, but you might be able to use an existing 'stratified sampling' function and then evaluate the outcome to see whether it satisfies your additional requirements, eg不确定这是否适合您的应用,但您可以使用现有的“分层抽样”function,然后评估结果以查看它是否满足您的其他要求,例如

# Load packages
library(tidyverse)
#install.packages("splitTools")
library(splitTools)

# set seed
set.seed(12345)

# create data
people <- data.frame(Name = paste("Name", 1:51),
                     Var1 = sample(c("A", "B"), 51, prob = c(0.3, 0.7), replace = TRUE),
                     Var2 = sample(1:2, 51, replace = TRUE))
table(people$Var1, people$Var2)

# proportion of "people" in each split
prop <- 1/9
inds <- partition(people$Var1, p = c(a = prop, b = prop, c = prop,
                                     d = prop, e = prop, f = prop,
                                     g = prop, h = prop, i = prop))
# split the patients (load dfs into a list)
dfs <- list()
for (i in 1:9){
  dfs[[i]] <- people[inds[[i]],]
}
# name the dfs
names(dfs) <- c("df_01", "df_02", "df_03", "df_04", "df_05",
                "df_06", "df_07", "df_08", "df_09")

# check requirements (at least 1 "A" in Var1)
for (i in seq_along(dfs)){
  if(!nrow(filter(dfs[[i]], Var1 == "A")) >= 1){
    print("error")
  }
}

# If no error, load dataframes into global environment
list2env(dfs, envir=.GlobalEnv)

df_01
#      Name Var1 Var2
#5   Name 5    B    1
#9   Name 9    A    1
#14 Name 14    B    1
#26 Name 26    B    2
#27 Name 27    B    2
#38 Name 38    A    1

df_02
#      Name Var1 Var2
#2   Name 2    A    2
#10 Name 10    A    1
#16 Name 16    B    1
#19 Name 19    B    2
#29 Name 29    B    1
#39 Name 39    B    1

df_03
#      Name Var1 Var2
#7   Name 7    B    1
#17 Name 17    B    1
#25 Name 25    B    2
#33 Name 33    B    2
#44 Name 44    A    2
#51 Name 51    A    1

df_04
#      Name Var1 Var2
#20 Name 20    A    2
#30 Name 30    B    2
#34 Name 34    B    1
#37 Name 37    A    2
#45 Name 45    B    2
#50 Name 50    B    1

df_05
#      Name Var1 Var2
#3   Name 3    A    1
#12 Name 12    B    2
#13 Name 13    A    1
#15 Name 15    B    1
#22 Name 22    B    1
#36 Name 36    B    1

df_06
#      Name Var1 Var2
#1   Name 1    A    1
#8   Name 8    B    1
#11 Name 11    B    2
#23 Name 23    A    2
#35 Name 35    B    1
#42 Name 42    B    2

df_07
#     Name Var1 Var2
#6   Name 6    B    1
#21 Name 21    B    1
#43 Name 43    A    1
#47 Name 47    B    1

df_08
#     Name Var1 Var2
#24 Name 24    A    1
#28 Name 28    B    1
#32 Name 32    B    1
#41 Name 41    A    1
#46 Name 46    B    2
#48 Name 48    B    2

df_09
#      Name Var1 Var2
#4   Name 4    A    1
#18 Name 18    B    1
#31 Name 31    A    1
#40 Name 40    B    1
#49 Name 49    B    2

This has the issue that "df_07" only has 4 rows, but if you change the seed - eg set.seed(123) - and run it again you get groups with at least one "A" and 5 or 6 rows in each.这存在“df_07”只有 4 行的问题,但是如果您更改种子 - 例如set.seed(123) - 并再次运行它,您将获得至少有一个“A”且每个组有 5 或 6 行的组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM