简体   繁体   中英

Frequency count histogram displaying only integer values on the y-axis?

I'd much appreciate anyone's help to resolve this question please. It seems like it should be so simple, but after many hours experimenting, I've had to stop in and ask for help. Thank you very much in advance!

Summary of question:

How can one ensure in ggplot2 the y-axis of a histogram is labelled using only integers (frequency count values) and not decimals?

The functions, arguments and datatype changes tried so far include:

  • geom_histogram() , geom_bar() and geom(col) - in each case, including, or not, the argument stat = "identity" where relevant.
  • adding + scale_y_discrete() , with or without + scale_x_discrete()
  • converting the underlying count data to a factor and/or the bin data to a factor

Ideally, the solution would be using baseR or ggplot2, instead of additional external dependencies eg by using the function pretty_breaks() func in the scales package, or similar.

Sample data:

sample <- data.frame(binMidPts = c(4500,5500,6500,7500), counts = c(8,0,9,3))

The x-axis consists of bins of a continuous variable, and the y-axis is intended to show the count of observations in those bins. For example, Bin 1 covers the x-axis range [4000 <= x < 5000], has a mid-point 4500, with 8 data points observed in that bin / range.

Code that almost works:

The following code generates a graph similar to the one I'm seeking, however the y-axis is labelled with decimal values on the breaks (which aren't valid as the data are integer count values).

ggplot(data = sample, aes (x = binMidPts, y = counts)) + geom_col()

Graph produced by this code is: 带有“不正确”连续 y 轴的简单 geom_col 图

I realise I could hard-code the breaks / labels onto a scale_y_continuous() axis but (a) I'd prefer a flexible solution to apply to many differently sized datasets where the scale isn't know in advance, and (b) I expect there must be a simpler way to generate a basic histogram.

References

I've consulted many Stack Overflow questions, the ggplot2 manual ( https://ggplot2.tidyverse.org/reference/scale_discrete.html ), the sthda.com examples and various blogs. These tend to address related problems, eg using scale_y_continuous , or where count data is not available in the underlying dataset and thus rely on stat_bin() for a transformation.

Any help would be much appreciated. Thank you.

// Update 1 - Extending scale to zero

Future readers of this thread may find it helpful to know that the range of break values formed by base::pretty() does not necessarily extend to zero. Thus, the axis scale may omit values between zero and the lower range of the breaks, as shown here: y 轴中断省略低于 pretty() 的下限

To resolve this, I included '0' in the range() parameter, ie:

ggplot(data = sample, aes (x = binMidPts, y = counts)) + geom_col() +
    scale_y_continuous(breaks=round(pretty(range(0,sample$counts))))

which gives the desired full scale on the y-axis, thus:

y 轴刻度延伸到零

How about:


ggplot(data = sample, aes (x = binMidPts, y = counts)) + geom_col() +
    scale_y_continuous( breaks=round(pretty( range(sample$counts) )) )

在此处输入图像描述

This answer suggests pretty_breaks from the scales package. The manual page of pretty_breaks mentions pretty from base . And from there you just have to round it to the nearest integer.

Or you can calculate the breaks with some rules customized to the dataset you are working like this

library(ggplot2)

breaks_min <- 0
breaks_max <- max(sample[["counts"]])
# Assume 5 breaks is perferable
breaks_bin <- round((breaks_max - breaks_min) / 5)
custom_breaks <- seq(breaks_min, breaks_max, breaks_bin)

ggplot(data = sample, aes (x = binMidPts, y = counts)) + 
  geom_col() +
  scale_y_continuous(breaks = custom_breaks, expand = c(0, 0))

Created on 2021-04-28 by the reprex package (v2.0.0)

The default y-axis breaks is calculated with scales::extended_breaks() . This function factory has a ... argument that passes on arguments to labeling::extended , which has a Q argument for what it considers 'nice numbers'. If you omit the 2.5 from the default, you should get integer breaks when the range is 3 or larger.

library(ggplot2)
library(scales)

sample <- data.frame(binMidPts = c(4500,5500,6500,7500), counts = c(8,0,9,3))

ggplot(data = sample, aes (x = binMidPts, y = counts)) + 
  geom_col() +
  scale_y_continuous(
    breaks = extended_breaks(Q = c(1, 5, 2, 4, 3))
  )

Created on 2021-04-28 by the reprex package (v1.0.0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM