简体   繁体   中英

Error when generating histogram in R

I have a text file containing:

Tue Feb 11 12:19:39 +0000 2014
Tue Feb 11 12:19:56 +0000 2014
Tue Feb 11 12:20:04 +0000 2014

and i read it into r

dataset <- read.csv("Time.txt")

and in order for R to recognise the timestamps in the file, i write:

time <- strptime(dataset[,1], format = "%a %b %d %H:%M:%S %z %Y")

and whenever i try to plot a histogram with:

hist(time, breaks = 100)

it produces an error together with a generated histogram

In breaks[-1L] + breaks[-nB] : NAs produced by integer overflow

What could be the issue that is prompting this error?

Since you asked what could be causing the error here it is:

The error is created when the hist.default function calculates the midpoints of the histogram. This vector mids <- 0.5 * (breaks[-1L] + breaks[-nB]) calculates the halfway point between each break. The issue arises because the breaks are generated as integers:

If the argument breaks is numeric and length == 1 then the hist.default function (which is called by hist.POSIXt ) creates a vector of breaks based on the range of x and the number of breaks. This is done using the pretty command. For reasons I have not looked into too closely, if breaks is small enough that pretty(range(x),n=breaks, min.n = 1) returns only one of each value eg:

pretty(range(x), n = 35, min.n = 1)
#[1] 1392121179 1392121180 1392121181 1392121182 1392121183 1392121184
#[7] 1392121185 1392121186 1392121187 1392121188 1392121189 1392121190
#[13] 1392121191 1392121192 1392121193 1392121194 1392121195 1392121196
#[19] 1392121197 1392121198 1392121199 1392121200 1392121201 1392121202
#[25] 1392121203 1392121204

then the output is an integer type. If however, the number of breaks is larger and some of the outputs are duplicated:

pretty(range(x), n = 36, min.n = 1)
# [1] 1392121179 1392121180 1392121180 1392121181 1392121181 1392121182
# [7] 1392121182 1392121183 1392121183 1392121184 1392121184 1392121185
#[13] 1392121185 1392121186 1392121186 1392121187 1392121187 1392121188
#[19] 1392121188 1392121189 1392121189 1392121190 1392121190 1392121191
#[25] 1392121191 1392121192 1392121192 1392121193 1392121193 1392121194
#[31] 1392121194 1392121195 1392121195 1392121196 1392121196 1392121197
#[37] 1392121197 1392121198 1392121198 1392121199 1392121199 1392121200
#[43] 1392121200 1392121201 1392121201 1392121202 1392121202 1392121203
#[49] 1392121203 1392121204 1392121204

then the output is numeric .

Because R uses 32 bit integer types and POSIXt integers are large numbers, adding two POSIXt integers results in an overflow that R can't handle and returns NA . When pretty returns numeric, this is not a problem.

See also: What is integer overflow in R and how can it happen?

In practice, all this means is that, if you print out the hist structure returned, all of your mids values will be NA but I don't think it actually affects the plotting of the histogram. Thus it is only a warning.

EDIT: pretty internally uses seq.int

In my environement, it does not generate any errors.

dataset <- read.csv("Time.txt", header = F)
time <- strptime(dataset[,1], format = "%a %b %d %H:%M:%S %z %Y")
hist(as.numeric(time), breaks = 100)

Perhaps if you just convert time into numeric as above, error will disappear. Then, it is straightforward to change the x-axis of the histogram.

EDIT : The ggplot2 should not face this issue and is much simpler and modern :

ggplot(dataset) + geom_histogram(aes(x = V1), stat = "count", bins = 100)

Where V1 is the default name of the unique column of dataset created by read.csv() .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM