I have a basic understanding of R that mostly entails the ability to run regressions and summary statistics, so if there appear any gaps in my knowledge I would appreciate being pointed in the correct direction.
I have time series data in CSV that is formatted as follows:
Facility ID, Utility Type, Account No, Unit Name, Date 1, Date 2, Date 3, Date 4
There will be multiple rows for a specific account number referencing a unique utility type and facility (ie, one row entry for Unit Name = L, one row entry for Unit Name = USD). The account number values for a particular unit at every date are entered in each "date" column. I would like to be able to write a script that enables me to re-export the data where each Date column doesn't contain entries for multiple units. I would also like to then designate to R that the Date columns represent monthly time series data points, and from there do various time series analysis.
I appreciate your help in telling me how to clean up this data.
As requested, sample data:
Facility ID, Facility Name, State, Utility Type, Supplier, Account No., Unit Name, 7/1/14, 8/1/14
4015, Palm Court Apts, CA, Chilled Water, PG&E, 87993, USD, 42333, 41775
4015, Palm Court Apts, CA, Chilled Water, PG&E, 87993, ton-hr, 244278, 238035
4044, 18 Sawtelle, CA, Natural Gas, Chevron, 17965, USD, 4860, 5890
4044, 18 Sawtelle, CA, Natural Gas, Chevron, 17965, M^3, 7639, 8895
Example output:
Facility ID, Facility Name, State, Utility Type, Supplier, Account No., Quantity Consumed, Unit of Measure, Utility Bill, Currency, Date
4015, Palm Court Apts, CA, Chilled Water, PG&E, 87993, 244278, ton-hr, 42333, USD, 7/1/14
4015, Palm Court Apts, CA, Chilled Water, PG&E, 87993, 238035, ton-hr, 41775, USD, 8/1/14
4044, 18 Sawtelle, CA, Natural Gas, Chevron, 17965, 7639, M^3, 4860, USD, 7/1/14
4044, 18 Sawtelle, CA, Natural Gas, Chevron, 17965, 8895, M^3, 5890, USD, 8/1/14
library(reshape2)
d = read.csv("data.csv")
d.molten = melt(d,
id.vars=c("Facility.ID", "Facility.Name", "State", "Utility.Type", "Supplier", "Account.No.", "Unit.Name"),
variable.name = "Date"
)
The melt
function breaks up a "wide" format (with an undefined numbers of columns) to a "long" format, where each row is an observation. This is actually the preferred format for most things you'd do in R, at least when using packages from the "Hadleyverse" . Especially for time series.
But we're not done yet. Now you have the following structure:
Facility.ID Facility.Name … Date value
4015 Palm Court Apts X7.1.14 42333
We have to fix the dates that are currently just "strings". They had an "X" prepended since column names cannot start with a number, and cannot contain spaces.
d.molten$Date=as.Date(d.molten$Date, "X%m.%d.%y")
Now your dates will look correct, and you have one row for each observation:
Facility.ID Facility.Name … Date value
4015 Palm Court Apts 2014-07-01 42333
And now we can easily plot time series:
library(ggplot2)
ggplot(d.molten,
aes(x = Date, y = value, color = Facility.Name)) +
geom_point()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.