简体   繁体   English

如何基于r中的数字变量的间隔将数据帧分为子组

[英]how to split a data frame in to subgroup based on intervals of a numeric variable in r

I have a data frame ( df )that looks like: 我有一个数据框( df ),看起来像:

mi       chr    gen.pos
m4774   Ch01    0
m4775   Ch01    1.701
m4663   Ch01    5.519
m4777   Ch01    6.5
m4779   Ch01    11.067
m4780   Ch01    11.234
m3933   Ch01    11.449
m4782   Ch01    13.986
m5534   Ch01    119.277
m5536   Ch02    0.036
m5550   Ch02    4.26

the chr column as group, at first, get the intervals of 20 bins of column gen.pos for each group by this code: chr列作为组,首先,通过以下代码获取每个组的20个bin列gen.pos的间隔:

len <- as.data.frame(cbind(chr = unique(df$chr), 
  do.call(rbind, tapply(df$gen.pos, df$chr, function(x) {c(min = min(x), max = max(x))}))))
len$interval <- format(round((as.numeric(as.character(len$max))-as.numeric(as.character(len$min)))/20,3),nsmall=3)

so the len data frame is: 所以len数据帧是:

chr     min     max     interval
Ch01    0       119.277 5.964
Ch02    0.036   134.249 6.711
Ch03    0.07    93.596  4.676
Ch04    0.392   134.342 6.698
Ch05    0.581   96.842  4.813
Ch06    0.008   131.802 6.59

my task is to create a column called bin in df , assign index # for each interval of gen.pos for each group. 我的任务是在df创建一个名为bin的列,为每个组的gen.pos每个间隔分配索引号。 for example, the first bin 1 is assigned to 0~5.964 range of gen.pos , 2 assigned to 5.965 ~ 11.928 ( 5.964*2=11.928 ) ... The final result is like: 例如,第一仓1被分配到0~5.964范围的gen.pos2分配到5.965 ~ 11.9285.964*2=11.928 )...最后的结果是这样的:

mi      chr   gen.pos   bin
m4774   Ch01    0       1
m4775   Ch01    1.701   1
m4663   Ch01    5.519   1
m4777   Ch01    6.5     2
m4779   Ch01    11.067  2
m4780   Ch01    11.234  2
m3933   Ch01    11.449  2
m4782   Ch01    13.986  3
m5534   Ch01    119.277 20
m5536   Ch02    0.036   1
m5550   Ch02    4.26    1

The len data frame output is not necessary. len数据帧的输出不是必需的。 It is just used to describe my question more clearly. 它只是用来更清楚地描述我的问题。 Thanks for any helps. 感谢您的帮助。

len is important cursor, so I reproduce here for clarity as you did len是重要的游标,因此为了清晰起见,我在这里重现

library(dplyr)
len <- df %>% 
         group_by(chr) %>%
         summarize(min=min(gen.pos), max=max(gen.pos), interval= (max-min)/20) 

Let's say bin width is b=interval , then if x=gen.pos doesn't coincide the endpoints of the intervals, it falls into ceiling((x-min)/b) th interval. 假设bin宽度为b=interval ,则如果x=gen.pos与间隔的端点不一致,则它会落入ceiling((x-min)/b)个间隔。 So 所以

df %>% 
  group_by(chr) %>% 
  mutate(max   = max(gen.pos), 
         min   = min(gen.pos), 
         width = (max-min)/20, 
         bin1  = ceiling((gen.pos-min)/width),
         bin   = ifelse(gen.pos==min, bin1 + 1, bin1)
         ) 

will produce the desired column with awesome dplyr. 会产生令人敬畏的dplyr所需的色谱柱。 (you can ditch the obsolete columns with select command) (您可以使用select命令删除过时的列)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM