简体   繁体   English

R:找出哪些观测值位于直方图的每个“条”中

[英]R: Find out which observations are located in each "bar" of the histogram

I am working with the R programming language.我正在使用 R 编程语言。 Suppose I have the following data:假设我有以下数据:

     a = rnorm(1000,10,1)
     b = rnorm(200,3,1)
     c = rnorm(200,13,1)
    
    d = c(a,b,c)
index <- 1:1400

my_data = data.frame(index,d)

I can make the following histograms of the same data by adjusting the "bin" length (via the "breaks" option):我可以通过调整“bin”长度(通过“breaks”选项)制作相同数据的以下直方图:

hist(my_data, breaks = 10, main = "Histogram #1, Breaks = 10")
 hist(my_data, breaks = 100, main = "Histogram #2, Breaks = 100")
 hist(my_data, breaks = 5, main = "Histogram #3, Breaks = 5")

在此处输入图像描述

My Question: In each one of these histograms there are a different number of "bars" (ie bins).我的问题:在这些直方图中的每一个中,都有不同数量的“条”(即箱)。 For example, in the first histogram there are 8 bars and in the third histogram there are 4 bars.例如,在第一个直方图中有 8 个条形图,在第三个直方图中有 4 个条形图。 For each one of these histograms, is there a way to find out which observations (from the original file "d") are located in each bar?对于这些直方图中的每一个,有没有办法找出每个条中的观察值(来自原始文件“d”)?

Right now, I am trying to manually do this, eg (for histogram #3)现在,我正在尝试手动执行此操作,例如(对于直方图 #3)

histogram3_bar1 <- my_data[which(my_data$d < 5 & my_data$d > 0), ]
histogram3_bar2 <- my_data[which(my_data$d < 10 & my_data$d > 5), ]
histogram3_bar3 <- my_data[which(my_data$d < 15 & my_data$d > 10), ]
histogram3_bar4 <- my_data[which(my_data$d < 15 & my_data$d > 20), ]


head(histogram3_bar1)

    index        d
1001  1001 4.156393
1002  1002 3.358958
1003  1003 1.605904
1004  1004 3.603535
1006  1006 2.943456
1007  1007 1.586542

But is there a more "efficient" way to do this?但是有没有更“有效”的方法来做到这一点?

Thanks!谢谢!

hist itself can provide for the solution to the question's problem, to find out which data points are in which intervals. hist本身可以为问题的问题提供解决方案,以找出哪些数据点位于哪些区间内。 hist returns a list with first member breaks hist返回一个包含第一个成员breaks的列表

First, make the problem reproducible by setting the RNG seed.首先,通过设置 RNG 种子使问题可重现。

set.seed(2021)
a = rnorm(1000,10,1)
b = rnorm(200,3,1)
c = rnorm(200,13,1)
d = c(a,b,c)

Now, save the return value of hist and have findInterval tell the bins where each data points are in.现在,保存hist的返回值并让findInterval告诉 bin 每个数据点所在的位置。

h1 <- hist(d, breaks = 10)
f1 <- findInterval(d, h1$breaks)

h1$breaks
# [1] -2  0  2  4  6  8 10 12 14 16

head(f1)
#[1] 6 7 7 7 7 6

The first six observations are intervals 6 and 7 with end points 8, 10 and 12, as can be seen indexing d by f1 :前六个观测值是区间 6 和 7,端点分别为 8、10 和 12,如f1索引d所示:

head(d[f1])
#[1]  8.07743 10.26174 10.26174 10.26174 10.26174  8.07743

As for whether the intervals given by end points 8, 10 and 12 are left- or right-closed, see help("findInterval") .至于端点 8、10、12 给出的区间是左闭还是右闭,参见help("findInterval")

As a final check, table the values returned by findInterval and see if they match the histogram's counts.作为最后的检查,列出findInterval返回的值并查看它们是否与直方图的计数匹配。

table(f1)
#f1
#  1   2   3   4   5   6   7   8   9 
#  2  34 130  34  17 478 512 169  24 
h1$counts
#[1]   2  34 130  34  17 478 512 169  24

To have the intervals for each data point, the following要获得每个数据点的间隔,请执行以下操作

bins <- data.frame(bin = f1, min = h1$breaks[f1], max = h1$breaks[f1 + 1L])
head(bins)
#  bin min max
#1   6   8  10
#2   7  10  12
#3   7  10  12
#4   7  10  12
#5   7  10  12
#6   6   8  10

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM