[英]Merge groups based on sample size in R
I have a table in the following format. 我有一个以下格式的表。 I simplified it so illustrate the problem (number of samples are random, in my data they add up to 10000 but the structure is the same)
我对其进行了简化,以说明问题(样本数量是随机的,在我的数据中,它们的总和为10000,但结构相同)
# 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50
# 700 1000 1400 1700 1900 1500 1000 300 50 1
The groups are created dynamically based on the min and max value of my input. 这些组是根据我输入的最小值和最大值动态创建的。 y refers to my input random sample.
y是指我输入的随机样本。 I created this table using the following code.
我使用以下代码创建了该表。
groups <- seq(0, 50, (50-0) / 10)
assoc <- cut(sr$y, groups, include.lowest = TRUE)
tab <- tabulate(assoc, nbins = length(groups) -1 )
Now my goal is to merge the colums (and its samples) with the next one if it does not fullfill the condition of eg 100 samples. 现在,我的目标是将列(及其样本)与下一个合并,如果它不能满足例如100个样本的条件。 I got to the point of checking with a which:
我到了要检查的地方:
sn <- which(tab < 60) + 1
And now I am stuck with merging the colums and its sample data. 现在,我坚持合并各栏及其示例数据。 I really would appreciate some help.
我真的很感谢您的帮助。
One solution can be achieved using gather
, separate
, unite
and spread
from tidyr
package. 可以使用实现一个解决方案
gather
, separate
, unite
和spread
从tidyr
包。
The approach is: 方法是:
Spread
and separate
to get data in row-wise format with from
& to
Spread
和separate
得到处理逐行格式的数据与from
和to
group
by merging a row with samples
less 100
with next row. samples
少于100
行与下一行来分配group
。 min
of from
, max
of to
and sum
of samples
min
的from
, max
的to
和sum
的samples
unite
and spread
to get the data.frame in original format. unite
并spread
以获得原始格式的data.frame。 Solution#1 解决方案#1
library(dplyr)
library(tidyr)
gather(df, key, samples) %>%
separate(key, c("from", "to"), sep = "-") %>%
group_by(grp = ifelse(samples >= 100 | lag(samples)<100,row_number(), row_number()+1)) %>%
summarise(from = min(from), to = max(to), samples = sum(samples)) %>%
select(-grp) %>%
mutate(from = sprintf("%2s",from)) %>%
unite("key", from, to, sep="-") %>%
spread(key, samples) %>% as.data.frame()
# 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-50
# 1 700 1000 1400 1700 1900 1500 1000 300 51
Solution#2: 解决方案#2:
if OP's intention is to continue groping of columns until a target samples (eg 100) is reached then we need a custom function to create group. 如果OP的意图是继续对列进行摸索,直到达到目标样本(例如100个),那么我们需要一个自定义函数来创建组。 The function will be as:
该函数将为:
findGroup <- function(x, targetVal = 100){
grp <- seq_along(x)
for(i in seq_along(x[-length(x)])){
if(x[i] < targetVal){
x[i+1] = x[i+1] + x[i]
grp[i+1] = grp[i]
}
}
grp
}
# Use findGroup function to organize data. Just line with `group_by` has been changed.
gather(df, key, samples) %>%
separate(key, c("from", "to"), sep = "-") %>%
group_by(grp = findGroup(samples)) %>%
summarise(from = min(from), to = max(to), samples = sum(samples)) %>%
select(-grp) %>%
mutate(from = sprintf("%2s",from)) %>%
unite("key", from, to, sep="-") %>%
spread(key, samples) %>% as.data.frame()
# 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-50
# 1 700 1000 1400 1700 1900 1500 1000 300 51
Data 数据
df <- structure(list(`0-5` = 700L, `5-10` = 1000L, `10-15` = 1400L, `15-20` = 1700L,
`20-25` = 1900L, `25-30` = 1500L, `30-35` = 1000L, `35-40` = 300L,
`40-45` = 50L, `45-50` = 1L), .Names = c("0-5", "5-10",
"10-15", "15-20", "20-25", "25-30", "30-35", "35-40", "40-45",
"45-50"), class = "data.frame", row.names = 1L)
df
# 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50
# 1 700 1000 1400 1700 1900 1500 1000 300 50 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.