简体   繁体   English

HMISC-CUT2-创造时代因素

[英]Hmisc - cut2 - create factors from times

I'm trying to use the cut2() function from the Hmisc package to create a factor based on time periods. 我正在尝试使用Hmisc包中cut2()函数来基于时间段创建一个因子。

Here's some code: 这是一些代码:

library(Hmisc)

i.time <- as.POSIXct("2013-07-16 13:55:14 CEST")
f.time <- i.time+as.difftime(1, units="hours")

data.points <- seq(from=i.time, to=f.time, by="1 sec")
cut.points <- seq(from=i.time, to=f.time, by="60 sec")



intervals <- cut2(x=data.points, cuts=cut.points, minmax=TRUE)

I expected intervals to be created such that each point in data.point were placed in a interval of time. 我希望创建间隔,以便将data.point中的每个点放置在一定的时间间隔中。 But there are some NA values in the end: 但是最后还有一些NA值:

> tail(intervals, 1)
[1] <NA>
60 Levels: [2013-07-16 13:55:14,2013-07-16 13:56:14) ... [2013-07-16 14:54:14,2013-07-16 14:55:14]

I was expecting that the option minmax=TRUE would make sure that hte cuts included all the values in data.points . 我期望选项minmax=TRUE将确保hte cuts包括data.points所有值。

Can anyone clarify what's going on here? 任何人都可以澄清这里发生了什么吗? How can I use the cut2 function to generate a factor that includes all the values in the data? 如何使用cut2函数生成一个包含数据中所有值的因子?

The reason I use cut2 in preference to cut is that its default for "right" is the way I expect it to work (left-closed intervals). 我优先使用cut2进行cut的原因是,它默认的“ right”是我期望它的工作方式(左封闭间隔)。 Looking at the code I see that when 'cuts' is present in the argument list, then the cut function is used with a shifted set of cuts that has the effect of making the intervals left-closed, and then the code relabels the factor to change the "(" 's to [" 's, but then does not use include.lowest = TRUE . This has the effect of turning the last value into <NA> . Frankly, I see this as a bug. After looking at this more closely I see that cut2 's help page does not promise to handle either Date or date-time objects, so "bug" is too strong. It completely fails with Date objects and it appears to be only an accident that is is almost correct with POSIXct objects. (This implementation is somewhat surprising to me in that I always assumed that it was just using cut( ... , right=FALSE, include.lowest=TRUE) .) 查看代码,我看到当参数列表中出现“ cuts”时, cut函数与一组偏移的cuts一起使用,其效果是使间隔左向闭合,然后代码将因子重新标记为将"("更改为[" ,但是不使用include.lowest = TRUE 。这具有将最后一个值转换为<NA>坦率地说,我将其视为错误。我更仔细地看到, cut2的帮助页面不能保证处理Datedate-time对象,因此“ bug”太强了,对于Date对象它完全失败了,这似乎只是一次意外,几乎是(对于此实现,POSIXct对象是正确的。(这种实现方式令我有些惊讶,因为我始终以为只是使用cut( ... , right=FALSE, include.lowest=TRUE) 。)

You can alter the code and one idea I had was to extend the range back to the right end point in the original data by changing this line: 您可以更改代码,我的一个想法是通过更改此行来将范围扩展回原始数据中的正确端点:

r <- range(x,  na.rm = TRUE)

To this line: 到这行:

r <- range(c(x,max(x)+min(diff(x.unique))/2),  na.rm = TRUE)

It's not exactly the result I expected since you get a new category at the right end because the penultimate interval was still open on the right. 这并不是我期望的结果,因为您在右端获得了一个新类别,因为倒数第二个间隔仍在右侧打开。

intervals <- cut3(x=data.points, cuts=cut.points, minmax=TRUE)
> tail(intervals, 1)
[1] 2013-07-16 14:55:14
61 Levels: [2013-07-16 13:55:14,2013-07-16 13:56:14) ...
> tail(intervals, 2)
[1] [2013-07-16 14:54:14,2013-07-16 14:55:14) 2013-07-16 14:55:14                      
61 Levels: [2013-07-16 13:55:14,2013-07-16 13:56:14) ...

A different idea gives a more satisfactory result. 不同的想法给出了更令人满意的结果。 Change only this line: 仅更改此行:

y <- cut(x, k2)

To to this: 对此:

y <- cut(x, k2, include.lowest=TRUE)

Giving the expected right and left closed interval and no NA: 给定预期的左右关闭间隔,不设NA:

 tail(intervals, 2)
[1] [2013-07-16 14:54:14,2013-07-16 14:55:14] [2013-07-16 14:54:14,2013-07-16 14:55:14]
60 Levels: [2013-07-16 13:55:14,2013-07-16 13:56:14) ...

Note: include.lowest =TRUE with right=FALSE, will actually become include.highest . 注意: include.lowest = TRUE,right = FALSE,实际上将变为include.highest And I'm scratching my head about why I am actually getting the desired behavior in this case when I did not also need to do something with the 'right' parameter. 而且,当我不需要使用“正确的”参数做某事时,为什么我在这种情况下实际上会得到所需的行为,我也正在摸索。 I sent Frank Harrell a message, and he is willing to consider revisions to the code to handle other cases. 我向弗兰克·哈雷尔(Frank Harrell)发送了一条消息,他愿意考虑对代码进行修订以处理其他情况。 I'm working on that. 我正在努力。

Why this is an issue: The labeling for cut.POSIXt and cut.Date differs from the labeling of cut.numeric (actually cut.default ) results. 为什么会出现此问题: cut.POSIXtcut.Date的标签与cut.numeric (实际上是cut.default )的标签不同。 The former two label strategy is to just reprot the beginnings of the intervals whereas the labeling from cut.numeric includes "[" and ")" and the ends of the intervals. 前两种标签策略只是重新间隔的开始,而来自cut.numeric的标签包括“ [”和“)”以及间隔的结束。 Compare the output from these: 比较这些输出:

levels( cut(0+1:100, 3) )
levels( cut(Sys.time()+1:100, 3) )
levels( cut(Sys.Date()+1:100, 3) )

from ??cut2 : ??cut2

minmax : if cuts is specified but min(x) < min(cuts) or max(x) > max(cuts), augments cuts to include min and max x minmax:如果指定了cuts,但min(x)<min(cuts)或max(x)> max(cuts),则增加cuts以包括min和max x

Checking your arguments: 检查您的论点:

x=data.points
cuts=cut.points
r <- range(x, na.rm = TRUE)
 (r[1] < min(cuts) | (r[2] > max(cuts)))
FALSE ## no need to include mean and max

So here setting minmax don't change the result. 因此,在此处设置minmax不会更改结果。 But here a result using cut by setting include.lowest=TRUE) : 但是这里通过设置include.lowest=TRUE)使用cut的结果:

res <- cut(x=data.points, breaks=cut.points, include.lowest=TRUE)
table(is.na(res))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM