简体   繁体   English

根据R中的样本大小合并组

[英]Merge groups based on sample size in R

I have a table in the following format. 我有一个以下格式的表。 I simplified it so illustrate the problem (number of samples are random, in my data they add up to 10000 but the structure is the same) 我对其进行了简化,以说明问题(样本数量是随机的,在我的数据中,它们的总和为10000,但结构相同)

# 0-5    5-10    10-15    15-20    20-25    25-30    30-35    35-40    40-45    45-50
# 700    1000    1400     1700     1900     1500     1000      300       50      1   

The groups are created dynamically based on the min and max value of my input. 这些组是根据我输入的最小值和最大值动态创建的。 y refers to my input random sample. y是指我输入的随机样本。 I created this table using the following code. 我使用以下代码创建了该表。

groups <- seq(0, 50, (50-0) / 10)
assoc <- cut(sr$y, groups, include.lowest = TRUE)
tab <- tabulate(assoc, nbins = length(groups) -1 )

Now my goal is to merge the colums (and its samples) with the next one if it does not fullfill the condition of eg 100 samples. 现在,我的目标是将列(及其样本)与下一个合并,如果它不能满足例如100个样本的条件。 I got to the point of checking with a which: 我到了要检查的地方:

sn <- which(tab < 60) + 1

And now I am stuck with merging the colums and its sample data. 现在,我坚持合并各栏及其示例数据。 I really would appreciate some help. 我真的很感谢您的帮助。

One solution can be achieved using gather , separate , unite and spread from tidyr package. 可以使用实现一个解决方案gatherseparateunitespreadtidyr包。

The approach is: 方法是:

  • Use Spread and separate to get data in row-wise format with from & to Spreadseparate得到处理逐行格式的数据与fromto
  • Assign group by merging a row with samples less 100 with next row. 通过合并一行samples少于100行与下一行来分配group
  • Calculate min of from , max of to and sum of samples 计算minfrommaxtosumsamples
  • Finally unite and spread to get the data.frame in original format. 最后unitespread以获得原始格式的data.frame。

Solution#1 解决方案#1

library(dplyr)
library(tidyr)

gather(df, key, samples) %>%
separate(key, c("from", "to"), sep = "-") %>%
group_by(grp = ifelse(samples >= 100 | lag(samples)<100,row_number(), row_number()+1)) %>%
summarise(from = min(from), to = max(to), samples = sum(samples)) %>%
select(-grp) %>%
mutate(from = sprintf("%2s",from)) %>%
unite("key", from, to, sep="-") %>%
spread(key, samples) %>% as.data.frame()
#    0-5  5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-50
# 1  700  1000  1400  1700  1900  1500  1000   300    51

Solution#2: 解决方案#2:

if OP's intention is to continue groping of columns until a target samples (eg 100) is reached then we need a custom function to create group. 如果OP的意图是继续对列进行摸索,直到达到目标样本(例如100个),那么我们需要一个自定义函数来创建组。 The function will be as: 该函数将为:

findGroup <- function(x, targetVal = 100){
  grp <- seq_along(x)
  for(i in seq_along(x[-length(x)])){
    if(x[i] < targetVal){
      x[i+1] = x[i+1] + x[i]
      grp[i+1] = grp[i]
    }
  }
  grp
}

# Use findGroup function to organize data. Just line with `group_by` has been changed.
gather(df, key, samples) %>%
  separate(key, c("from", "to"), sep = "-") %>%
  group_by(grp = findGroup(samples)) %>%
  summarise(from = min(from), to = max(to), samples = sum(samples)) %>%
  select(-grp) %>%
  mutate(from = sprintf("%2s",from)) %>%
  unite("key", from, to, sep="-") %>%
  spread(key, samples) %>% as.data.frame()

#    0-5  5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-50
# 1  700  1000  1400  1700  1900  1500  1000   300    51

Data 数据

df <- structure(list(`0-5` = 700L, `5-10` = 1000L,  `10-15` = 1400L, `15-20` = 1700L, 
                     `20-25` = 1900L, `25-30` = 1500L, `30-35` = 1000L, `35-40` = 300L, 
                     `40-45` = 50L, `45-50` = 1L), .Names = c("0-5", "5-10",
                     "10-15", "15-20", "20-25", "25-30", "30-35", "35-40", "40-45", 
                     "45-50"), class = "data.frame", row.names = 1L)
df
  #   0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50
  # 1 700 1000  1400  1700  1900  1500  1000   300    50     1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM