简体   繁体   English

R-使用非空交集创建类似割的间隔

[英]R - Create cut-like intervals with non-empty intersection

I have a dataframe df with a column named x1 with values between -5 and +5. 我有一个数据框df ,其列名为x1 ,其值介于-5和+5之间。 I am trying to assign for each row of df an interval regarding the values of x1 . 我正在尝试为df的每一行分配一个有关x1值的间隔。 The function cut allow me do to so : 功能cut让我这样做:

cut(df$x1,c(-5,-4,-3,-2,-1,0,1,2,3,4,5))

and I can then split df into 10 data.frames using by . 然后我可以使用bydf分成10个data.frames Unfortunately what I would like is to assign intervals like -5 to -3.95, -4.05 to -2.95, -3.05 to -1.95 and so on, meaning that : 不幸的是,我想分配的间隔是-5到-3.95,-4.05到-2.95,-3.05到-1.95,依此类推,这意味着:

  • 4.06 will be in the interval 3.95-5.05 4.06将在3.95-5.05之间
  • 4.05 will be in the interval 3.95-5.05 and 2.95-4.05 4.05将在3.95-5.05和2.95-4.05之间
  • 4.04 will be in the interval 3.95-5.05 and 2.95-4.05 4.04将位于3.95-5.05和2.95-4.05之间
  • 3.94 will be in the interval 2.95-4.05 3.94将在2.95-4.05之间

which means that after using by I will have 10 dataframes with a few elements in 2 of those dataframes. 这意味着使用by之后by我将拥有10个数据帧,其中两个数据帧中有几个元素。

The next part of my question would concern the values near 0 : the intervals should not contain negative and positive values, so the intervals would be like 我的问题的下一部分将关注0附近的值:时间间隔不应包含负值和正值,因此时间间隔将类似于

  • -5 to -3.95 -5至-3.95
  • -4.05 to -2.95 -4.05至-2.95
  • ... ...
  • -2.05 to -0.95 -2.05至-0.95
  • -1.05 to 0 AND NOT -1.05 to 0.05 -1.05至0而不是-1.05至0.05
  • 0 to 1.05 AND NOT -0.05 to 1.05 0至1.05,而非-0.05至1.05
  • 0.95 to 2.05 0.95至2.05
  • ... ...

Is there a way to achieve that in R ? 有办法在R中实现吗?

EDIT : df 编辑:df

df looks like this : df看起来像这样:

other_var  ...   x1  ... another_var ...
    100    ... 4     ...   18     ...
    12.3   ... 3.84  ...   -6.2   ...
    1.4    ... 4.78  ...    4.78  ...
    -2     ... -2.51 ...    7.1   ...
    -3.2   ... 0.57  ...   -1     ...


dput(df1)

structure(list(x0 = c(0.702166747375488, 0.205532096598193,     0.0704982518296982, 
-0.159150628995597, -0.162625494967927, -0.331660025490033, -0.099135847436449, 
-0.137985446193678, -0.179304942878067, 0.0554309512268647), 
x1 = c(-0.561621170364712, -0.762747775318984, 1.63791710226613, 
-0.861210697757564, -1.05393723031543, 0.809872536189693, 
2.85973319518198, 0.211750306033687, 1.18360826959114, -0.358159130198865
), x2 = c(-0.304711385106637, 0.365667729645747, -0.406328268107825, 
-0.315315872233279, -0.477546612710489, 0.251158976293131, 
-1.1263800774781, 0.229002212764429, -0.00413111289214729, 
-0.252467704090853)), .Names = c("x0", "x1", "x2"), row.names = c(NA, 
10L), class = "data.frame")

I could not see a solution with creating intervals with cut that did not lead to multiple columns, so I approached it from another angle: iterate over all cutpoints and return the subset for that min and max. 我看不到创建带有不导致多列的剪切间隔的解决方案,因此我从另一个角度进行了研究:遍历所有剪切点并返回该最小值和最大值的子集。

intervals <- data.frame(min=c(-5,-4.05+0:3,0,0.95+0:3))
intervals$max <- rev(intervals$min)*-1
intervals$name <- with(intervals, sprintf("[%.2f;%.2f)",min,max))
res <- lapply(split(intervals,intervals$name), function(x){
  return(df1[df1$x1> x$min & df1$x1 <= x$max,])
})

> head(res)
$`[-1.05;-0.00)`
            x0         x1         x2
1   0.70216675 -0.5616212 -0.3047114
2   0.20553210 -0.7627478  0.3656677
4  -0.15915063 -0.8612107 -0.3153159
10  0.05543095 -0.3581591 -0.2524677

$`[-2.05;-0.95)`
          x0        x1         x2
5 -0.1626255 -1.053937 -0.4775466

$`[-3.05;-1.95)`
[1] x0 x1 x2
<0 rows> (or 0-length row.names)

$`[-4.05;-2.95)`
[1] x0 x1 x2
<0 rows> (or 0-length row.names)

$`[-5.00;-3.95)`
[1] x0 x1 x2
<0 rows> (or 0-length row.names)

$`[0.00;1.05)`
          x0        x1        x2
6 -0.3316600 0.8098725 0.2511590
8 -0.1379854 0.2117503 0.2290022

Here's a solution that uses foverlaps(...) in the data.table package. 这是在data.table包中使用foverlaps(...)的解决方案。 Unfortunately. 不幸。 you need the most recent developmental version for this to work. 您需要最新的开发版本才能正常工作。 Uses the intervals data.frame from the other answer. 使用另一个答案的intervals data.frame。

##install.packages("devtools")
# library(devtools)
# install_github("Rdatatable/data.table", build_vignettes = FALSE)

library(data.table)
y    <- with(df1,data.table(row=1:nrow(df1),lo=x1, hi=x1, key=c("lo","hi")))
cuts <- foverlaps(setDT(intervals),y, by.x=c("min","max"))[,list(row,name)]
lapply(split(cuts, cuts$name),function(s)df1[sort(s$row),]) 
# $`[-1.05;-0.00)`
#            x0         x1         x2
# 1   0.70216675 -0.5616212 -0.3047114
# 2   0.20553210 -0.7627478  0.3656677
# 4  -0.15915063 -0.8612107 -0.3153159
# 10  0.05543095 -0.3581591 -0.2524677
#
# $`[-2.05;-0.95)`
#           x0        x1         x2
# 5 -0.1626255 -1.053937 -0.4775466
#
# $`[-3.05;-1.95)`
# [1] x0 x1 x2
# <0 rows> (or 0-length row.names)
#...

foverlaps(x,y,...) does an "overlap join", that is, it finds all the records in y which which have overlaps in x . foverlaps(x,y,...)执行“重叠连接”,即找到y中在x中有重叠的所有记录。 Overlaps are defined as values in a range between to columns in y (say, a and b), which overlap the corresponding range in two columns in x (say c and d). 重叠定义为yy列(例如a和b)之间的范围内的值,该值与x两列(例如c和d)中的相应范围重叠。 In this case we use, for x , the intervals data.frame (converted to a data.table), and for y , a data.table formed with the lo and hi columns both = df$x1 . 在这种情况下,对于x ,我们使用intervals data.frame(转换为data.table),对于y ,使用由lo和hi列组成的data.table都= df$x1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM