[英]R - Create cut-like intervals with non-empty intersection
I have a dataframe df
with a column named x1
with values between -5 and +5. 我有一个数据框
df
,其列名为x1
,其值介于-5和+5之间。 I am trying to assign for each row of df
an interval regarding the values of x1
. 我正在尝试为
df
的每一行分配一个有关x1
值的间隔。 The function cut
allow me do to so : 功能
cut
让我这样做:
cut(df$x1,c(-5,-4,-3,-2,-1,0,1,2,3,4,5))
and I can then split df
into 10 data.frames
using by
. 然后我可以使用
by
将df
分成10个data.frames
。 Unfortunately what I would like is to assign intervals like -5 to -3.95, -4.05 to -2.95, -3.05 to -1.95 and so on, meaning that : 不幸的是,我想分配的间隔是-5到-3.95,-4.05到-2.95,-3.05到-1.95,依此类推,这意味着:
which means that after using by
I will have 10 dataframes with a few elements in 2 of those dataframes. 这意味着使用
by
之后by
我将拥有10个数据帧,其中两个数据帧中有几个元素。
The next part of my question would concern the values near 0 : the intervals should not contain negative and positive values, so the intervals would be like 我的问题的下一部分将关注0附近的值:时间间隔不应包含负值和正值,因此时间间隔将类似于
Is there a way to achieve that in R ? 有办法在R中实现吗?
EDIT : df 编辑:df
df
looks like this : df
看起来像这样:
other_var ... x1 ... another_var ...
100 ... 4 ... 18 ...
12.3 ... 3.84 ... -6.2 ...
1.4 ... 4.78 ... 4.78 ...
-2 ... -2.51 ... 7.1 ...
-3.2 ... 0.57 ... -1 ...
dput(df1)
structure(list(x0 = c(0.702166747375488, 0.205532096598193, 0.0704982518296982,
-0.159150628995597, -0.162625494967927, -0.331660025490033, -0.099135847436449,
-0.137985446193678, -0.179304942878067, 0.0554309512268647),
x1 = c(-0.561621170364712, -0.762747775318984, 1.63791710226613,
-0.861210697757564, -1.05393723031543, 0.809872536189693,
2.85973319518198, 0.211750306033687, 1.18360826959114, -0.358159130198865
), x2 = c(-0.304711385106637, 0.365667729645747, -0.406328268107825,
-0.315315872233279, -0.477546612710489, 0.251158976293131,
-1.1263800774781, 0.229002212764429, -0.00413111289214729,
-0.252467704090853)), .Names = c("x0", "x1", "x2"), row.names = c(NA,
10L), class = "data.frame")
I could not see a solution with creating intervals with cut that did not lead to multiple columns, so I approached it from another angle: iterate over all cutpoints and return the subset for that min and max. 我看不到创建带有不导致多列的剪切间隔的解决方案,因此我从另一个角度进行了研究:遍历所有剪切点并返回该最小值和最大值的子集。
intervals <- data.frame(min=c(-5,-4.05+0:3,0,0.95+0:3))
intervals$max <- rev(intervals$min)*-1
intervals$name <- with(intervals, sprintf("[%.2f;%.2f)",min,max))
res <- lapply(split(intervals,intervals$name), function(x){
return(df1[df1$x1> x$min & df1$x1 <= x$max,])
})
> head(res)
$`[-1.05;-0.00)`
x0 x1 x2
1 0.70216675 -0.5616212 -0.3047114
2 0.20553210 -0.7627478 0.3656677
4 -0.15915063 -0.8612107 -0.3153159
10 0.05543095 -0.3581591 -0.2524677
$`[-2.05;-0.95)`
x0 x1 x2
5 -0.1626255 -1.053937 -0.4775466
$`[-3.05;-1.95)`
[1] x0 x1 x2
<0 rows> (or 0-length row.names)
$`[-4.05;-2.95)`
[1] x0 x1 x2
<0 rows> (or 0-length row.names)
$`[-5.00;-3.95)`
[1] x0 x1 x2
<0 rows> (or 0-length row.names)
$`[0.00;1.05)`
x0 x1 x2
6 -0.3316600 0.8098725 0.2511590
8 -0.1379854 0.2117503 0.2290022
Here's a solution that uses foverlaps(...)
in the data.table package. 这是在data.table包中使用
foverlaps(...)
的解决方案。 Unfortunately. 不幸。 you need the most recent developmental version for this to work.
您需要最新的开发版本才能正常工作。 Uses the
intervals
data.frame from the other answer. 使用另一个答案的
intervals
data.frame。
##install.packages("devtools")
# library(devtools)
# install_github("Rdatatable/data.table", build_vignettes = FALSE)
library(data.table)
y <- with(df1,data.table(row=1:nrow(df1),lo=x1, hi=x1, key=c("lo","hi")))
cuts <- foverlaps(setDT(intervals),y, by.x=c("min","max"))[,list(row,name)]
lapply(split(cuts, cuts$name),function(s)df1[sort(s$row),])
# $`[-1.05;-0.00)`
# x0 x1 x2
# 1 0.70216675 -0.5616212 -0.3047114
# 2 0.20553210 -0.7627478 0.3656677
# 4 -0.15915063 -0.8612107 -0.3153159
# 10 0.05543095 -0.3581591 -0.2524677
#
# $`[-2.05;-0.95)`
# x0 x1 x2
# 5 -0.1626255 -1.053937 -0.4775466
#
# $`[-3.05;-1.95)`
# [1] x0 x1 x2
# <0 rows> (or 0-length row.names)
#...
foverlaps(x,y,...)
does an "overlap join", that is, it finds all the records in y
which which have overlaps in x
. foverlaps(x,y,...)
执行“重叠连接”,即找到y
中在x
中有重叠的所有记录。 Overlaps are defined as values in a range between to columns in y
(say, a and b), which overlap the corresponding range in two columns in x
(say c and d). 重叠定义为
y
到y
列(例如a和b)之间的范围内的值,该值与x
两列(例如c和d)中的相应范围重叠。 In this case we use, for x
, the intervals
data.frame (converted to a data.table), and for y
, a data.table formed with the lo and hi columns both = df$x1
. 在这种情况下,对于
x
,我们使用intervals
data.frame(转换为data.table),对于y
,使用由lo和hi列组成的data.table都= df$x1
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.