[英]how to split a data frame in to subgroup based on intervals of a numeric variable in r
I have a data frame ( df
)that looks like: 我有一个数据框( df
),看起来像:
mi chr gen.pos
m4774 Ch01 0
m4775 Ch01 1.701
m4663 Ch01 5.519
m4777 Ch01 6.5
m4779 Ch01 11.067
m4780 Ch01 11.234
m3933 Ch01 11.449
m4782 Ch01 13.986
m5534 Ch01 119.277
m5536 Ch02 0.036
m5550 Ch02 4.26
the chr
column as group, at first, get the intervals of 20 bins of column gen.pos
for each group by this code: 将chr
列作为组,首先,通过以下代码获取每个组的20个bin列gen.pos
的间隔:
len <- as.data.frame(cbind(chr = unique(df$chr),
do.call(rbind, tapply(df$gen.pos, df$chr, function(x) {c(min = min(x), max = max(x))}))))
len$interval <- format(round((as.numeric(as.character(len$max))-as.numeric(as.character(len$min)))/20,3),nsmall=3)
so the len
data frame is: 所以len
数据帧是:
chr min max interval
Ch01 0 119.277 5.964
Ch02 0.036 134.249 6.711
Ch03 0.07 93.596 4.676
Ch04 0.392 134.342 6.698
Ch05 0.581 96.842 4.813
Ch06 0.008 131.802 6.59
my task is to create a column called bin
in df
, assign index # for each interval of gen.pos
for each group. 我的任务是在df
创建一个名为bin
的列,为每个组的gen.pos
每个间隔分配索引号。 for example, the first bin 1
is assigned to 0~5.964
range of gen.pos
, 2
assigned to 5.965 ~ 11.928
( 5.964*2=11.928
) ... The final result is like: 例如,第一仓1
被分配到0~5.964
范围的gen.pos
, 2
分配到5.965 ~ 11.928
( 5.964*2=11.928
)...最后的结果是这样的:
mi chr gen.pos bin
m4774 Ch01 0 1
m4775 Ch01 1.701 1
m4663 Ch01 5.519 1
m4777 Ch01 6.5 2
m4779 Ch01 11.067 2
m4780 Ch01 11.234 2
m3933 Ch01 11.449 2
m4782 Ch01 13.986 3
m5534 Ch01 119.277 20
m5536 Ch02 0.036 1
m5550 Ch02 4.26 1
The len
data frame output is not necessary. len
数据帧的输出不是必需的。 It is just used to describe my question more clearly. 它只是用来更清楚地描述我的问题。 Thanks for any helps. 感谢您的帮助。
len
is important cursor, so I reproduce here for clarity as you did len
是重要的游标,因此为了清晰起见,我在这里重现
library(dplyr)
len <- df %>%
group_by(chr) %>%
summarize(min=min(gen.pos), max=max(gen.pos), interval= (max-min)/20)
Let's say bin width is b=interval
, then if x=gen.pos
doesn't coincide the endpoints of the intervals, it falls into ceiling((x-min)/b)
th interval. 假设bin宽度为b=interval
,则如果x=gen.pos
与间隔的端点不一致,则它会落入ceiling((x-min)/b)
个间隔。 So 所以
df %>%
group_by(chr) %>%
mutate(max = max(gen.pos),
min = min(gen.pos),
width = (max-min)/20,
bin1 = ceiling((gen.pos-min)/width),
bin = ifelse(gen.pos==min, bin1 + 1, bin1)
)
will produce the desired column with awesome dplyr. 会产生令人敬畏的dplyr所需的色谱柱。 (you can ditch the obsolete columns with select
command) (您可以使用select
命令删除过时的列)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.