简体   繁体   English

汇总,重组R中的每小时时间序列数据

[英]Aggregating, restructuring hourly time series data in R

I have a year's worth of hourly data in a data frame in R: 我在R的数据框中有一年的每小时数据价值:

> str(df.MHwind_load)   # compactly displays structure of data frame
'data.frame':   8760 obs. of  6 variables:
 $ Date         : Factor w/ 365 levels "2010-04-01","2010-04-02",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Time..HRs.   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Hour.of.Year : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Wind.MW      : int  375 492 483 476 486 512 421 396 456 453 ...
 $ MSEDCL.Demand: int  13293 13140 12806 12891 13113 13802 14186 14104 14117 14462 ...
 $ Net.Load     : int  12918 12648 12323 12415 12627 13290 13765 13708 13661 14009 ...

While preserving the hourly structure, I would like to know how to extract 在保留小时结构的同时,我想知道如何提取

  1. a particular month/group of months 特定月份/一组月份
  2. the first day/first week etc of each month 每个月的第一天/第一周等
  3. all mondays, all tuesdays etc of the year 一年中的所有星期一,星期二等

I have tried using "cut" without result and after looking online think that "lubridate" might be able to do so but haven't found suitable examples. 我尝试使用“切割”但没有结果,并且在网上浏览后认为“润滑”可能可以,但没有找到合适的例子。 I'd greatly appreciate help on this issue. 非常感谢您在此问题上的帮助。

Edit: a sample of data in the data frame is below: 编辑:数据框中的数据示例如下:

  Date Hour.of.Year  Wind.MW  datetime
1  2010-04-01  1  375  2010-04-01  00:00:00
2  2010-04-01  2  492  2010-04-01  01:00:00
3  2010-04-01  3  483  2010-04-01  02:00:00
4  2010-04-01  4  476  2010-04-01  03:00:00
5  2010-04-01  5  486  2010-04-01  04:00:00
6  2010-04-01  6  512  2010-04-01  05:00:00
7  2010-04-01  7  421  2010-04-01  06:00:00
8  2010-04-01  8  396  2010-04-01  07:00:00
9  2010-04-01  9  456  2010-04-01  08:00:00
10  2010-04-01  10  453  2010-04-01  09:00:00
..  ..  ...  ..........  ........
8758  2011-03-31  8758  302  2011-03-31  21:00:00
8759  2011-03-31  8759  378  2011-03-31  22:00:00
8760  2011-03-31  8760  356  2011-03-31  23:00:00

EDIT: Additional time-based operations I would like to perform on the same dataset 1. Perform hour-by-hour averaging for all data points ie average of all values in the first hour of each day in the year. 编辑:我想对同一数据集执行其他基于时间的操作1.对所有数据点进行逐小时平均,即一年中每天第一个小时中所有值的平均值。 The output will be an "hourly profile" of the entire year (24 time points) 2. Perform the same for each week and each month ie obtain 52 and 12 hourly profiles respectively 3. Do seasonal averages, for example for June to September 输出将是整个年度(24个时间点)的“每小时配置文件”。2.每周和每月执行相同的操作,即分别获取52和12个每小时配置文件。3.执行季节性平均值,例如6月至9月

Convert the date to the format which lubridate understands and then use the functions month , mday , wday respectively. 将日期转换为lubridate可以理解的格式,然后分别使用monthmdaywday函数。

Suppose you have a data.frame with the time stored in column Date , then the answer for your questions would be: 假设您有一个data.frame,其时间存储在Date列中,那么您的问题的答案将是:

 ###dummy data.frame
 df <- data.frame(Date=c("2012-01-01","2012-02-15","2012-03-01","2012-04-01"),a=1:4) 
 ##1. Select rows for particular month
 subset(df,month(Date)==1)

 ##2a. Select the first day of each month
 subset(df,mday(Date)==1)

 ##2b. Select the first week of each month
 ##get the week numbers which have the first day of the month
 wkd <- subset(week(df$Date),mday(df$Date)==1)
 ##select the weeks with particular numbers
 subset(df,week(Date) %in% wkd)     

 ##3. Select all mondays 
 subset(df,wday(Date)==1)
  1. First switch to a Date representation: as.Date(df.MHwind_load$Date) 首先切换到Date表示形式: as.Date(df.MHwind_load$Date)
  2. Then call weekdays on the date vector to get a new factor labelled with day of week 然后在日期向量上调用weekdays以获取标有星期几的新因子
  3. Then call months on the date vector to get a new factor labelled with name of month 然后在日期向量上调用months以获取一个标记为month的新因子
  4. Optionally create a years variable (see below). (可选)创建years变量(请参见下文)。

Now subset the data frame using the relevant combination of these. 现在,使用这些元素的相关组合对数据帧进行subset设置。 Step 2. gets an answer to your task 3. Steps 3. and 4. get you to task 1. Task 2 might require a line or two of R. Or just select rows corresponding to, say, all the Mondays in a month and call unique , or its alter-ego duplicated on the results. 步骤2。获得任务3的答案。步骤3和步骤4。获得任务1的任务。任务2可能需要R的一行或两行。或者只选择对应于一个月中所有星期一和星期一的行。调用unique ,或者在结果上duplicated其alter-ego。

To get you going... 为了让你走...

newdf <- df.MHwind_load ## build an augmented data set
newdf$d <- as.Date(newdf$Date)
newdf$month <- months(newdf$d)
newdf$day <- weekdays(newdf$d)

## for some reason R has no years function.  Here's one
years <- function(x){ format(as.Date(x), format = "%Y") }

newdf$year <- years(newdf$d)

# get observations from January to March of every year
subset(newdf, month %*% in c('January', 'February', 'March'))

# get all Monday observations
subset(newdf, day == 'Monday')

# get all Mondays in 1999
subset(newdf, day == 'Monday' & year == '1999')

# slightly fancier: _first_ Monday of each month
# get the first weeks
first.week.of.month <- !duplicated(cbind(newdf$month, newdf$day)) 
# now pull out the mondays
subset(newdf, first.monday.of.month & day=='Monday')

Since you're not asking about the time (hourly) part of your data, it is best to then store your data as a Date object. 由于您不是在询问数据的时间(小时)部分,因此最好将数据存储为Date对象。 Otherwise, you might be interested in chron , which also has some convenience functions like you'll see below. 否则,您可能会对chron感兴趣,它还具有一些便捷功能,如下所示。

With respect to Conjugate Prior's answer, you should store your date data as a Date object. 关于Conjugate Prior的答案,您应该将日期数据存储为Date对象。 Since your data already follows the default format ('yyyy-mm-dd') you can just call as.Date on it. 由于您的数据已经遵循默认格式('yyyy-mm-dd'),因此您可以在其上调用as.Date。 Otherwise, you would have to specify your string format. 否则,您将必须指定您的字符串格式。 I would also use as.character on your factor to make sure you don't get errors inline. 我还要在您的因素上使用as.character,以确保您不会内联错误。 I know I've ran into problems with factors-into-Dates for that reason (possibly corrected in current version). 我知道出于这个原因我已经遇到了因素因素(可能在当前版本中已解决)。

df.MHwind_load <- transform(df.MHwind_load, Date = as.Date(as.character(Date)))

Now you would do well to create wrapper functions that extract the information you desire. 现在,您可以很好地创建包装函数,以提取所需的信息。 You could use transform like I did above to simply add those columns that represent months, days, years, etc, and then subset on them logically. 您可以像上面一样使用transform来简单地添加代表月,日,年等的列,然后在逻辑上对其进行子集化。 Alternatively, you might do something like this: 或者,您可以执行以下操作:

getMonth <- function(x, mo) {  # This function assumes w/in single year vector
  isMonth <- month(x) %in% mo  # Boolean of matching months
  return(x[which(isMonth)]     # Return vector of matching months
}  # end function

Or, in short form 或者,简写形式

getMonth <- function(x, mo) x[month(x) %in% mo]

This is just a tradeoff between storing that information (transform frame) or having it processed when desired (use accessor methods). 这只是在存储该信息(转换帧)或在需要时对其进行处理(使用访问器方法)之间的权衡。

A more complicated process is your need for, say, the first day of a month. 例如,一个更复杂的过程是您需要一个月的第一天。 This is not entirely difficult, though. 但是,这并不完全困难。 Below is a function that will return all of those values, but it is rather simple to just subset a sorted vector of values for a given month and take their first one. 下面是一个将返回所有这些值的函数,但仅对给定月份的排序后的值向量进行子集并采用它们的第一个就相当简单。

getFirstDay <- function(x, mo) {
  isMonth <- months(x) %in% mo
  x <- sort(x[isMonth])  # Look at only those in the desired month.
                         # Sort them by date. We only want the first day.
  nFirsts <- rle(as.numeric(x))$len[1]  # Returns length of 1st days
  return(x[seq(nFirsts)])
}  # end function

The easier alternative would be 更容易的选择是

getFirstDayOnly <- function(x, mo) {sort(x[months(x) %in% mo])[1]}

I haven't prototyped these, as you didn't provide any data samples, but this is the sort of approach that can help you get the information you desire. 由于您没有提供任何数据样本,因此我没有为它们提供原型,但这是可以帮助您获取所需信息的一种方法。 It is up to you to figure out how to put these into your work flow. 由您自己决定如何将它们放入您的工作流程中。 For instance, say you want to get the first day for each month of a given year (assuming we're only looking at one year; you can create wrappers or pre-process your vector to a single year beforehand). 例如,假设您要获得给定年份每个月的第一天(假设我们只查看一年;您可以创建包装器或将向量预先处理为一年)。

# Return a vector of first days for each month
df <- transform(df, date = as.Date(as.character(date)))
sapply(unique(months(df$date)),  # Iterate through months in Dates
       function(month) {getFirstDayOnly(df$date, month)})

The above could also be designed as a separate convenience function that uses the other accessor function. 上面的内容也可以设计为使用其他访问器功能的单独的便捷功能。 In this way, you create a series of direct but concise methods for getting pieces of the information you want. 这样,您可以创建一系列直接而简洁的方法来获取所需的信息。 Then you simply pull them together to create very simple and easy to interpret functions that you can use in your scripts to get you precise what you desire in the most efficient manner. 然后,您只需将它们组合在一起即可创建非常简单易懂的函数,您可以在脚本中使用这些函数,从而以最有效的方式使您精确地掌握所需的内容。

You should be able to use the above examples to figure out how to prototype other wrappers for accessing the date information you require. 您应该能够使用上面的示例来弄清楚如何为其他包装器提供原型,以访问所需的日期信息。 If you need help on those, feel free to ask in a comment. 如果您需要这些方面的帮助,请随时在评论中提问。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM