简体   繁体   中英

Variables that affect histogram plotted with hist() function in R

In R it is possible to plot the histogram and save it's properties to variable:

> h1=hist(c(1,1,2,3,4,5,5), breaks=0.5:5.5)

Those properties can be then read:

> h1
$breaks
[1] 0.5 1.5 2.5 3.5 4.5 5.5

$counts
[1] 2 1 1 1 2

$density
[1] 0.2857143 0.1428571 0.1428571 0.1428571 0.2857143

$mids
[1] 1 2 3 4 5

$xname
[1] "c(1, 1, 2, 3, 4, 5, 5)"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

How does those properties affect histogram? So far I've figured out following:

The relationship between $breaks and $counts . $breaks represents the interval to which the plotted data may fall and $counts represents the amount of data that have fallen into this interval, for example:

[] denotes closed interval (endpoints are included)

() denotes open interval (endpoints are not included)

BREAKS  : COUNTS
[0.5-1.5] : 2 # There are two 1 which falls into this interval
(1.5-2.5] : 1 # There is one 2 which falls into this interval
(2.5-3.5] : 1 # There is one 3 which falls into this interval
(3.5-4.5] : 1 # There is one 4 which falls into this interval
(4.5-5.5] : 2 # There are two 5 which falls into this interval

The relationship between $breaks and $density is basically the same as above but written in percents, for example:

BREAKS  : DENSITY
[0.5-1.5] : 0.2857143 # This interval covers cca 28% of plot
(1.5-2.5] : 0.1428571 # This interval covers cca 14% of plot
(2.5-3.5] : 0.1428571 # This interval covers cca 14% of plot
(3.5-4.5] : 0.1428571 # This interval covers cca 14% of plot
(4.5-5.5] : 0.2857143 # This interval covers cca 28% of plot

Of course when you sum all those values you will get 1:

> sum(h1$density)
[1] 1

Following stands for x-axis name:

$xname
[1] "c(1, 1, 2, 3, 4, 5, 5)"

But what does do the remaining do, especially $mids ?

$mids
[1] 1 2 3 4 5

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

Also the help(hist) returns many others, shouldn't they be also listed in above output, if not why? As it is explained in following article

By default, bin counts include values less than or equal to the bin's right break point and strictly greater than the bin's left break point, except for the leftmost bin, which includes its left break point.

So following:

h1=hist(c(1,1,2,3,4,5,5,1.5), breaks=0.5:5.5)

will return histogram where 1.5 will fall into 0.5-1.5 interval. One "workaround" is to set interval size smaller eg

h1=hist(c(1,1,2,3,4,5,5,1.5), breaks=seq(0.5,5.5,0.1))

but this seems unpractical to me, and it also adds bunch of 0 to $counts and $density , is there a better, automatic way?

Except this it also have one side effect that I cannot explain myself: why the last example return in summary 10 and not 1?

> sum(h1$density)
[1] 10
> h1$density[h1$density>0]
[1] 2.50 1.25 1.25 1.25 1.25 2.50

Question 1 What do $mids and $equidist mean: From the help file:

mids: the n cell midpoints.

equidist: logical, indicating if the distances between breaks are all the same.


Q2: Yes, with h1=hist(c(1,1,2,3,4,5,5,1.5), breaks=0.5:5.5) 1.5 will fall into the 0.5-1.5 categorie. If you want it to fall into the 1.5-2.5 categorie, you should use:

h1=hist(c(1,1,2,3,4,5,5,1.5), breaks=0.49:5.49)

or much neater:

h1=hist(c(1,1,2,3,4,5,5,1.5), breaks=0.5:5.5, right=FALSE)

I'm not sure what you want to automate here, but hopefully the above answers your question. If not please me more clear about your question.


Q3 About density being 10 and not 1, that is because densities are no frequencies. From the help file:

density: values f^(x[i]), as estimated density values. If all(diff(breaks) == 1), they are the relative frequencies counts/n and in general satisfy sum[i; f^(x[i]) (b[i+1]-b[i])] = 1, where b[i] = breaks[i].

So if your breaks are not equal to 1, density will not sum up to 1.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM