简体   繁体   English

R中的时间序列数据

[英]Time Series Data in R

I have a basic understanding of R that mostly entails the ability to run regressions and summary statistics, so if there appear any gaps in my knowledge I would appreciate being pointed in the correct direction. 我对R有一个基本的了解,主要是需要具备运行回归和汇总统计信息的能力,因此,如果我的知识有任何空白,我将为您指出正确的方向。

I have time series data in CSV that is formatted as follows: 我有CSV格式的时间序列数据,其格式如下:

Facility ID, Utility Type, Account No, Unit Name, Date 1, Date 2, Date 3, Date 4

There will be multiple rows for a specific account number referencing a unique utility type and facility (ie, one row entry for Unit Name = L, one row entry for Unit Name = USD). 针对特定帐号的多行将引用唯一的实用程序类型和工具(即,单位名称= L的一行条目,单位名称= USD的一行条目)。 The account number values for a particular unit at every date are entered in each "date" column. 在每个“日期”列中输入每个日期特定单位的帐号值。 I would like to be able to write a script that enables me to re-export the data where each Date column doesn't contain entries for multiple units. 我希望能够编写一个脚本,使我能够在每个“日期”列不包含多个单位条目的情况下重新导出数据。 I would also like to then designate to R that the Date columns represent monthly time series data points, and from there do various time series analysis. 然后,我还要向R指定Date列代表每月的时间序列数据点,然后从中进行各种时间序列分析。

I appreciate your help in telling me how to clean up this data. 感谢您在告诉我如何清除此数据方面的帮助。

As requested, sample data: 根据要求,样本数据:

Facility ID, Facility Name, State, Utility Type, Supplier, Account No., Unit Name, 7/1/14, 8/1/14
4015, Palm Court Apts, CA, Chilled Water, PG&E, 87993, USD, 42333, 41775
4015, Palm Court Apts, CA, Chilled Water, PG&E, 87993, ton-hr, 244278, 238035
4044, 18 Sawtelle, CA, Natural Gas, Chevron, 17965, USD, 4860, 5890
4044, 18 Sawtelle, CA, Natural Gas, Chevron, 17965, M^3, 7639, 8895

Example output: 输出示例:

Facility ID, Facility Name, State, Utility Type, Supplier, Account No., Quantity Consumed, Unit of Measure, Utility Bill, Currency, Date
4015, Palm Court Apts, CA, Chilled Water, PG&E, 87993, 244278, ton-hr, 42333, USD, 7/1/14
4015, Palm Court Apts, CA, Chilled Water, PG&E, 87993, 238035, ton-hr, 41775, USD, 8/1/14
4044, 18 Sawtelle, CA, Natural Gas, Chevron, 17965, 7639, M^3, 4860, USD, 7/1/14
4044, 18 Sawtelle, CA, Natural Gas, Chevron, 17965, 8895, M^3, 5890, USD, 8/1/14
library(reshape2)
d = read.csv("data.csv")
d.molten = melt(d, 
  id.vars=c("Facility.ID", "Facility.Name", "State", "Utility.Type", "Supplier", "Account.No.", "Unit.Name"), 
  variable.name = "Date"
)

The melt function breaks up a "wide" format (with an undefined numbers of columns) to a "long" format, where each row is an observation. melt函数将“宽”格式(具有不确定的列数)分解为“长”格式,其中每一行都是观察值。 This is actually the preferred format for most things you'd do in R, at least when using packages from the "Hadleyverse" . 实际上,这是您在R中要做的大多数事情的首选格式,至少在使用“ Hadleyverse”中的软件包更是如此 Especially for time series. 特别是对于时间序列。

But we're not done yet. 但是我们还没有完成。 Now you have the following structure: 现在您具有以下结构:

Facility.ID    Facility.Name …  Date  value
       4015  Palm Court Apts X7.1.14  42333

We have to fix the dates that are currently just "strings". 我们必须修复当前只是“字符串”的日期。 They had an "X" prepended since column names cannot start with a number, and cannot contain spaces. 由于列名不能以数字开头并且不能包含空格,因此它们前面带有“ X”。

d.molten$Date=as.Date(d.molten$Date, "X%m.%d.%y")

Now your dates will look correct, and you have one row for each observation: 现在您的日期看起来正确,并且每个观察结果都有一行:

Facility.ID    Facility.Name …     Date  value
       4015  Palm Court Apts 2014-07-01  42333

And now we can easily plot time series: 现在我们可以轻松地绘制时间序列:

library(ggplot2)
ggplot(d.molten, 
  aes(x = Date, y = value, color = Facility.Name)) + 
  geom_point()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM