简体   繁体   English

在R中生成直方图时出错

[英]Error when generating histogram in R

I have a text file containing: 我有一个包含以下内容的文本文件:

Tue Feb 11 12:19:39 +0000 2014
Tue Feb 11 12:19:56 +0000 2014
Tue Feb 11 12:20:04 +0000 2014

and i read it into r 我读进了r

dataset <- read.csv("Time.txt")

and in order for R to recognise the timestamps in the file, i write: 为了让R识别文件中的时间戳,我写道:

time <- strptime(dataset[,1], format = "%a %b %d %H:%M:%S %z %Y")

and whenever i try to plot a histogram with: 每当我尝试绘制直方图时:

hist(time, breaks = 100)

it produces an error together with a generated histogram 它与生成的直方图一起产生错误

In breaks[-1L] + breaks[-nB] : NAs produced by integer overflow

What could be the issue that is prompting this error? 可能是导致此错误的问题?

Since you asked what could be causing the error here it is: 由于您询问了什么可能导致错误,所以它是:

The error is created when the hist.default function calculates the midpoints of the histogram. hist.default函数计算直方图的中点时,会创建该错误。 This vector mids <- 0.5 * (breaks[-1L] + breaks[-nB]) calculates the halfway point between each break. 此向量中值mids <- 0.5 * (breaks[-1L] + breaks[-nB])计算每个中断之间的中点。 The issue arises because the breaks are generated as integers: 出现此问题是因为中断是作为整数生成的:

If the argument breaks is numeric and length == 1 then the hist.default function (which is called by hist.POSIXt ) creates a vector of breaks based on the range of x and the number of breaks. 如果breaks参数是numericlength == 1hist.default函数(由hist.POSIXt )会根据x的范围和中断次数创建breaks向量。 This is done using the pretty command. 这是使用pretty命令完成的。 For reasons I have not looked into too closely, if breaks is small enough that pretty(range(x),n=breaks, min.n = 1) returns only one of each value eg: 由于一些原因,我没有仔细研究,如果breaks足够小,从而pretty(range(x),n=breaks, min.n = 1)仅返回每个值之一,例如:

pretty(range(x), n = 35, min.n = 1)
#[1] 1392121179 1392121180 1392121181 1392121182 1392121183 1392121184
#[7] 1392121185 1392121186 1392121187 1392121188 1392121189 1392121190
#[13] 1392121191 1392121192 1392121193 1392121194 1392121195 1392121196
#[19] 1392121197 1392121198 1392121199 1392121200 1392121201 1392121202
#[25] 1392121203 1392121204

then the output is an integer type. 那么输出是integer类型。 If however, the number of breaks is larger and some of the outputs are duplicated: 但是,如果中断次数较多,则某些输出将重复:

pretty(range(x), n = 36, min.n = 1)
# [1] 1392121179 1392121180 1392121180 1392121181 1392121181 1392121182
# [7] 1392121182 1392121183 1392121183 1392121184 1392121184 1392121185
#[13] 1392121185 1392121186 1392121186 1392121187 1392121187 1392121188
#[19] 1392121188 1392121189 1392121189 1392121190 1392121190 1392121191
#[25] 1392121191 1392121192 1392121192 1392121193 1392121193 1392121194
#[31] 1392121194 1392121195 1392121195 1392121196 1392121196 1392121197
#[37] 1392121197 1392121198 1392121198 1392121199 1392121199 1392121200
#[43] 1392121200 1392121201 1392121201 1392121202 1392121202 1392121203
#[49] 1392121203 1392121204 1392121204

then the output is numeric . 然后输出为numeric

Because R uses 32 bit integer types and POSIXt integers are large numbers, adding two POSIXt integers results in an overflow that R can't handle and returns NA . 因为R使用32位整数类型,并且POSIXt整数是大数,所以将两个POSIXt整数相加会导致R无法处理并返回NA的溢出。 When pretty returns numeric, this is not a problem. pretty返回数字时,这不是问题。

See also: What is integer overflow in R and how can it happen? 另请参阅: R中的整数溢出是什么以及如何发生?

In practice, all this means is that, if you print out the hist structure returned, all of your mids values will be NA but I don't think it actually affects the plotting of the histogram. 实际上,所有这些意味着,如果您打印出返回的hist结构,则所有mids值都将为NA但我认为它实际上不会影响直方图的绘制。 Thus it is only a warning. 因此,这只是一个警告。

EDIT: pretty internally uses seq.int 编辑: pretty内部使用seq.int

In my environement, it does not generate any errors. 在我的环境中,它不会产生任何错误。

dataset <- read.csv("Time.txt", header = F)
time <- strptime(dataset[,1], format = "%a %b %d %H:%M:%S %z %Y")
hist(as.numeric(time), breaks = 100)

Perhaps if you just convert time into numeric as above, error will disappear. 也许如果只是将时间转换成上述数字,则错误将消失。 Then, it is straightforward to change the x-axis of the histogram. 然后,很容易更改直方图的x轴。

EDIT : The ggplot2 should not face this issue and is much simpler and modern : 编辑: ggplot2不应该面对这个问题,它更加简单和现代:

ggplot(dataset) + geom_histogram(aes(x = V1), stat = "count", bins = 100)

Where V1 is the default name of the unique column of dataset created by read.csv() . 其中V1是read.csv()创建的dataset的唯一列的默认名称。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM