[英]R: Find out which observations are located in each "bar" of the histogram
I am working with the R programming language.我正在使用 R 编程语言。 Suppose I have the following data:假设我有以下数据:
a = rnorm(1000,10,1)
b = rnorm(200,3,1)
c = rnorm(200,13,1)
d = c(a,b,c)
index <- 1:1400
my_data = data.frame(index,d)
I can make the following histograms of the same data by adjusting the "bin" length (via the "breaks" option):我可以通过调整“bin”长度(通过“breaks”选项)制作相同数据的以下直方图:
hist(my_data, breaks = 10, main = "Histogram #1, Breaks = 10")
hist(my_data, breaks = 100, main = "Histogram #2, Breaks = 100")
hist(my_data, breaks = 5, main = "Histogram #3, Breaks = 5")
My Question: In each one of these histograms there are a different number of "bars" (ie bins).我的问题:在这些直方图中的每一个中,都有不同数量的“条”(即箱)。 For example, in the first histogram there are 8 bars and in the third histogram there are 4 bars.例如,在第一个直方图中有 8 个条形图,在第三个直方图中有 4 个条形图。 For each one of these histograms, is there a way to find out which observations (from the original file "d") are located in each bar?对于这些直方图中的每一个,有没有办法找出每个条中的观察值(来自原始文件“d”)?
Right now, I am trying to manually do this, eg (for histogram #3)现在,我正在尝试手动执行此操作,例如(对于直方图 #3)
histogram3_bar1 <- my_data[which(my_data$d < 5 & my_data$d > 0), ]
histogram3_bar2 <- my_data[which(my_data$d < 10 & my_data$d > 5), ]
histogram3_bar3 <- my_data[which(my_data$d < 15 & my_data$d > 10), ]
histogram3_bar4 <- my_data[which(my_data$d < 15 & my_data$d > 20), ]
head(histogram3_bar1)
index d
1001 1001 4.156393
1002 1002 3.358958
1003 1003 1.605904
1004 1004 3.603535
1006 1006 2.943456
1007 1007 1.586542
But is there a more "efficient" way to do this?但是有没有更“有效”的方法来做到这一点?
Thanks!谢谢!
hist
itself can provide for the solution to the question's problem, to find out which data points are in which intervals. hist
本身可以为问题的问题提供解决方案,以找出哪些数据点位于哪些区间内。 hist
returns a list with first member breaks
hist
返回一个包含第一个成员breaks
的列表
First, make the problem reproducible by setting the RNG seed.首先,通过设置 RNG 种子使问题可重现。
set.seed(2021)
a = rnorm(1000,10,1)
b = rnorm(200,3,1)
c = rnorm(200,13,1)
d = c(a,b,c)
Now, save the return value of hist
and have findInterval
tell the bins where each data points are in.现在,保存hist
的返回值并让findInterval
告诉 bin 每个数据点所在的位置。
h1 <- hist(d, breaks = 10)
f1 <- findInterval(d, h1$breaks)
h1$breaks
# [1] -2 0 2 4 6 8 10 12 14 16
head(f1)
#[1] 6 7 7 7 7 6
The first six observations are intervals 6 and 7 with end points 8, 10 and 12, as can be seen indexing d
by f1
:前六个观测值是区间 6 和 7,端点分别为 8、10 和 12,如f1
索引d
所示:
head(d[f1])
#[1] 8.07743 10.26174 10.26174 10.26174 10.26174 8.07743
As for whether the intervals given by end points 8, 10 and 12 are left- or right-closed, see help("findInterval")
.至于端点 8、10、12 给出的区间是左闭还是右闭,参见help("findInterval")
。
As a final check, table the values returned by findInterval
and see if they match the histogram's counts.作为最后的检查,列出findInterval
返回的值并查看它们是否与直方图的计数匹配。
table(f1)
#f1
# 1 2 3 4 5 6 7 8 9
# 2 34 130 34 17 478 512 169 24
h1$counts
#[1] 2 34 130 34 17 478 512 169 24
To have the intervals for each data point, the following要获得每个数据点的间隔,请执行以下操作
bins <- data.frame(bin = f1, min = h1$breaks[f1], max = h1$breaks[f1 + 1L])
head(bins)
# bin min max
#1 6 8 10
#2 7 10 12
#3 7 10 12
#4 7 10 12
#5 7 10 12
#6 6 8 10
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.