简体   繁体   中英

Time Series Data in R

I have a basic understanding of R that mostly entails the ability to run regressions and summary statistics, so if there appear any gaps in my knowledge I would appreciate being pointed in the correct direction.

I have time series data in CSV that is formatted as follows:

Facility ID, Utility Type, Account No, Unit Name, Date 1, Date 2, Date 3, Date 4

There will be multiple rows for a specific account number referencing a unique utility type and facility (ie, one row entry for Unit Name = L, one row entry for Unit Name = USD). The account number values for a particular unit at every date are entered in each "date" column. I would like to be able to write a script that enables me to re-export the data where each Date column doesn't contain entries for multiple units. I would also like to then designate to R that the Date columns represent monthly time series data points, and from there do various time series analysis.

I appreciate your help in telling me how to clean up this data.

As requested, sample data:

Facility ID, Facility Name, State, Utility Type, Supplier, Account No., Unit Name, 7/1/14, 8/1/14
4015, Palm Court Apts, CA, Chilled Water, PG&E, 87993, USD, 42333, 41775
4015, Palm Court Apts, CA, Chilled Water, PG&E, 87993, ton-hr, 244278, 238035
4044, 18 Sawtelle, CA, Natural Gas, Chevron, 17965, USD, 4860, 5890
4044, 18 Sawtelle, CA, Natural Gas, Chevron, 17965, M^3, 7639, 8895

Example output:

Facility ID, Facility Name, State, Utility Type, Supplier, Account No., Quantity Consumed, Unit of Measure, Utility Bill, Currency, Date
4015, Palm Court Apts, CA, Chilled Water, PG&E, 87993, 244278, ton-hr, 42333, USD, 7/1/14
4015, Palm Court Apts, CA, Chilled Water, PG&E, 87993, 238035, ton-hr, 41775, USD, 8/1/14
4044, 18 Sawtelle, CA, Natural Gas, Chevron, 17965, 7639, M^3, 4860, USD, 7/1/14
4044, 18 Sawtelle, CA, Natural Gas, Chevron, 17965, 8895, M^3, 5890, USD, 8/1/14
library(reshape2)
d = read.csv("data.csv")
d.molten = melt(d, 
  id.vars=c("Facility.ID", "Facility.Name", "State", "Utility.Type", "Supplier", "Account.No.", "Unit.Name"), 
  variable.name = "Date"
)

The melt function breaks up a "wide" format (with an undefined numbers of columns) to a "long" format, where each row is an observation. This is actually the preferred format for most things you'd do in R, at least when using packages from the "Hadleyverse" . Especially for time series.

But we're not done yet. Now you have the following structure:

Facility.ID    Facility.Name …  Date  value
       4015  Palm Court Apts X7.1.14  42333

We have to fix the dates that are currently just "strings". They had an "X" prepended since column names cannot start with a number, and cannot contain spaces.

d.molten$Date=as.Date(d.molten$Date, "X%m.%d.%y")

Now your dates will look correct, and you have one row for each observation:

Facility.ID    Facility.Name …     Date  value
       4015  Palm Court Apts 2014-07-01  42333

And now we can easily plot time series:

library(ggplot2)
ggplot(d.molten, 
  aes(x = Date, y = value, color = Facility.Name)) + 
  geom_point()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM