Time series: What's the most efficient way to write code for subsets?

Question

I have two dataframes

df1

time x   y   state
...  ... ... CA
...  ... ... MA
...  ... ... TX
...  ... ... MA
...  ... ... CA
...  ... ... IL

df2

time x   y   state
...  ... ... MA
...  ... ... NY
...  ... ... MA
...  ... ... TX
...  ... ... CA
...  ... ... CA

I then have some code where I aggregate the monthly values, rename columns, match data with another list and subsequently merge df1 and df2 into one in about 50 lines of code. Here, I do not consider state so far.

However, I need to create subsets of the merged dataframe for several US states. Is there a more elegant way other than just copy/pasting the code used for df1 and df2 and replacing df1 and df2 with df1_CA, df2_MA etc.

Loop? Panel data?

Answer 1

One option could be to use the data.table package for the grouped analyses.

# transform your data.frame to data.table
dt1 <- as.data.table(df1)
dt2 <- as.data.table(df2)

# e.g. grouping values on state level
dt1[, sum(y), by=state]
# this will accumulate all y values by state

If you don't want to replace the df name in your code, you could define a function:

# define the function
accumulate <- function(df){
  dt <- as.data.table(df)
  return(dt[, sum(y), by=state])
}

# and call it 
accumulate(df1)
accumulate(df2)

instead of a for loop or similar on all your data.frames, one could use one of the apply functions that iterate effectively through data structures, eg lists

# alternatively define a list of data.frames and then iterate over the list
my.dfs <- list(df1,df2)
lapply(my.dfs, accumulate(df))

Time series: What's the most efficient way to write code for subsets?

Question

1 answers

solution1
1 ACCPTED 2017-09-01 12:38:24

Time series: What's the most efficient way to write code for subsets?

Question

1 answers

solution1 1 ACCPTED 2017-09-01 12:38:24

solution1
1 ACCPTED 2017-09-01 12:38:24