[英]Error when generating histogram in R
I have a text file containing: 我有一个包含以下内容的文本文件:
Tue Feb 11 12:19:39 +0000 2014
Tue Feb 11 12:19:56 +0000 2014
Tue Feb 11 12:20:04 +0000 2014
and i read it into r 我读进了r
dataset <- read.csv("Time.txt")
and in order for R to recognise the timestamps in the file, i write: 为了让R识别文件中的时间戳,我写道:
time <- strptime(dataset[,1], format = "%a %b %d %H:%M:%S %z %Y")
and whenever i try to plot a histogram with: 每当我尝试绘制直方图时:
hist(time, breaks = 100)
it produces an error together with a generated histogram 它与生成的直方图一起产生错误
In breaks[-1L] + breaks[-nB] : NAs produced by integer overflow
What could be the issue that is prompting this error? 可能是导致此错误的问题?
Since you asked what could be causing the error here it is: 由于您询问了什么可能导致错误,所以它是:
The error is created when the hist.default
function calculates the midpoints of the histogram. 当
hist.default
函数计算直方图的中点时,会创建该错误。 This vector mids <- 0.5 * (breaks[-1L] + breaks[-nB])
calculates the halfway point between each break. 此向量中值
mids <- 0.5 * (breaks[-1L] + breaks[-nB])
计算每个中断之间的中点。 The issue arises because the breaks are generated as integers: 出现此问题是因为中断是作为整数生成的:
If the argument breaks
is numeric
and length == 1
then the hist.default
function (which is called by hist.POSIXt
) creates a vector of breaks
based on the range of x
and the number of breaks. 如果
breaks
参数是numeric
且length == 1
则hist.default
函数(由hist.POSIXt
)会根据x
的范围和中断次数创建breaks
向量。 This is done using the pretty
command. 这是使用
pretty
命令完成的。 For reasons I have not looked into too closely, if breaks
is small enough that pretty(range(x),n=breaks, min.n = 1)
returns only one of each value eg: 由于一些原因,我没有仔细研究,如果
breaks
足够小,从而pretty(range(x),n=breaks, min.n = 1)
仅返回每个值之一,例如:
pretty(range(x), n = 35, min.n = 1)
#[1] 1392121179 1392121180 1392121181 1392121182 1392121183 1392121184
#[7] 1392121185 1392121186 1392121187 1392121188 1392121189 1392121190
#[13] 1392121191 1392121192 1392121193 1392121194 1392121195 1392121196
#[19] 1392121197 1392121198 1392121199 1392121200 1392121201 1392121202
#[25] 1392121203 1392121204
then the output is an integer
type. 那么输出是
integer
类型。 If however, the number of breaks is larger and some of the outputs are duplicated: 但是,如果中断次数较多,则某些输出将重复:
pretty(range(x), n = 36, min.n = 1)
# [1] 1392121179 1392121180 1392121180 1392121181 1392121181 1392121182
# [7] 1392121182 1392121183 1392121183 1392121184 1392121184 1392121185
#[13] 1392121185 1392121186 1392121186 1392121187 1392121187 1392121188
#[19] 1392121188 1392121189 1392121189 1392121190 1392121190 1392121191
#[25] 1392121191 1392121192 1392121192 1392121193 1392121193 1392121194
#[31] 1392121194 1392121195 1392121195 1392121196 1392121196 1392121197
#[37] 1392121197 1392121198 1392121198 1392121199 1392121199 1392121200
#[43] 1392121200 1392121201 1392121201 1392121202 1392121202 1392121203
#[49] 1392121203 1392121204 1392121204
then the output is numeric
. 然后输出为
numeric
。
Because R uses 32 bit integer types and POSIXt
integers are large numbers, adding two POSIXt
integers results in an overflow that R can't handle and returns NA
. 因为R使用32位整数类型,并且
POSIXt
整数是大数,所以将两个POSIXt
整数相加会导致R无法处理并返回NA
的溢出。 When pretty
returns numeric, this is not a problem. 当
pretty
返回数字时,这不是问题。
See also: What is integer overflow in R and how can it happen? 另请参阅: R中的整数溢出是什么以及如何发生?
In practice, all this means is that, if you print out the hist
structure returned, all of your mids
values will be NA
but I don't think it actually affects the plotting of the histogram. 实际上,所有这些意味着,如果您打印出返回的
hist
结构,则所有mids
值都将为NA
但我认为它实际上不会影响直方图的绘制。 Thus it is only a warning. 因此,这只是一个警告。
EDIT: pretty
internally uses seq.int
编辑:
pretty
内部使用seq.int
In my environement, it does not generate any errors. 在我的环境中,它不会产生任何错误。
dataset <- read.csv("Time.txt", header = F)
time <- strptime(dataset[,1], format = "%a %b %d %H:%M:%S %z %Y")
hist(as.numeric(time), breaks = 100)
Perhaps if you just convert time into numeric as above, error will disappear. 也许如果只是将时间转换成上述数字,则错误将消失。 Then, it is straightforward to change the x-axis of the histogram.
然后,很容易更改直方图的x轴。
EDIT : The ggplot2
should not face this issue and is much simpler and modern : 编辑:
ggplot2
不应该面对这个问题,它更加简单和现代:
ggplot(dataset) + geom_histogram(aes(x = V1), stat = "count", bins = 100)
Where V1 is the default name of the unique column of dataset
created by read.csv()
. 其中V1是
read.csv()
创建的dataset
的唯一列的默认名称。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.