In R it is possible to plot the histogram and save it's properties to variable:
> h1=hist(c(1,1,2,3,4,5,5), breaks=0.5:5.5)
Those properties can be then read:
> h1
$breaks
[1] 0.5 1.5 2.5 3.5 4.5 5.5
$counts
[1] 2 1 1 1 2
$density
[1] 0.2857143 0.1428571 0.1428571 0.1428571 0.2857143
$mids
[1] 1 2 3 4 5
$xname
[1] "c(1, 1, 2, 3, 4, 5, 5)"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
How does those properties affect histogram? So far I've figured out following:
The relationship between $breaks
and $counts
. $breaks
represents the interval to which the plotted data may fall and $counts
represents the amount of data that have fallen into this interval, for example:
[] denotes closed interval (endpoints are included)
() denotes open interval (endpoints are not included)
BREAKS : COUNTS
[0.5-1.5] : 2 # There are two 1 which falls into this interval
(1.5-2.5] : 1 # There is one 2 which falls into this interval
(2.5-3.5] : 1 # There is one 3 which falls into this interval
(3.5-4.5] : 1 # There is one 4 which falls into this interval
(4.5-5.5] : 2 # There are two 5 which falls into this interval
The relationship between $breaks
and $density
is basically the same as above but written in percents, for example:
BREAKS : DENSITY
[0.5-1.5] : 0.2857143 # This interval covers cca 28% of plot
(1.5-2.5] : 0.1428571 # This interval covers cca 14% of plot
(2.5-3.5] : 0.1428571 # This interval covers cca 14% of plot
(3.5-4.5] : 0.1428571 # This interval covers cca 14% of plot
(4.5-5.5] : 0.2857143 # This interval covers cca 28% of plot
Of course when you sum all those values you will get 1:
> sum(h1$density)
[1] 1
Following stands for x-axis name:
$xname
[1] "c(1, 1, 2, 3, 4, 5, 5)"
But what does do the remaining do, especially $mids
?
$mids
[1] 1 2 3 4 5
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
Also the help(hist)
returns many others, shouldn't they be also listed in above output, if not why? As it is explained in following article
By default, bin counts include values less than or equal to the bin's right break point and strictly greater than the bin's left break point, except for the leftmost bin, which includes its left break point.
So following:
h1=hist(c(1,1,2,3,4,5,5,1.5), breaks=0.5:5.5)
will return histogram where 1.5 will fall into 0.5-1.5 interval. One "workaround" is to set interval size smaller eg
h1=hist(c(1,1,2,3,4,5,5,1.5), breaks=seq(0.5,5.5,0.1))
but this seems unpractical to me, and it also adds bunch of 0 to $counts
and $density
, is there a better, automatic way?
Except this it also have one side effect that I cannot explain myself: why the last example return in summary 10 and not 1?
> sum(h1$density)
[1] 10
> h1$density[h1$density>0]
[1] 2.50 1.25 1.25 1.25 1.25 2.50
Question 1 What do $mids and $equidist mean: From the help file:
mids: the n cell midpoints.
equidist: logical, indicating if the distances between breaks are all the same.
Q2: Yes, with h1=hist(c(1,1,2,3,4,5,5,1.5), breaks=0.5:5.5)
1.5 will fall into the 0.5-1.5 categorie. If you want it to fall into the 1.5-2.5 categorie, you should use:
h1=hist(c(1,1,2,3,4,5,5,1.5), breaks=0.49:5.49)
or much neater:
h1=hist(c(1,1,2,3,4,5,5,1.5), breaks=0.5:5.5, right=FALSE)
I'm not sure what you want to automate here, but hopefully the above answers your question. If not please me more clear about your question.
Q3 About density being 10 and not 1, that is because densities are no frequencies. From the help file:
density: values f^(x[i]), as estimated density values. If all(diff(breaks) == 1), they are the relative frequencies counts/n and in general satisfy sum[i; f^(x[i]) (b[i+1]-b[i])] = 1, where b[i] = breaks[i].
So if your breaks are not equal to 1, density will not sum up to 1.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.