简体   繁体   中英

R - preprocessing of time series data

I have the following data structure, with Stocks S , having features f :

year S1_f1  S1_f2 S2_f1 S2_f2 S3_f1 S3_f2 Sn_f1 Sn_f2
2011   0.1    0.4  0.12  0.42   0.2   0.5     n     n
2012   0.4    0.7  0.42  0.72   0.5   0.8     n     n
2013   0.7    0.9  0.72   0.5   0.8   0.9     n     n
n        n      n     n     n     n     n     n     n

My original df has 10 observations but 50k+ predictors - so I want to generate more balance on the observation side.

Hence, I want to have the following dataframe:

year S1_f1 S1_f2 S2_f1 S2_f2 S3_f1 S3_f2 Sn_f1 Sn_f2
2011   0.1   0.4     0     0     0     0     0     0
2012   0.4   0.7     0     0     0     0     0     0
2013   0.7   0.9     0     0     0     0     0     0
2011     0     0  0.12  0.42     0     0     0     0
2012     0     0  0.42  0.72     0     0     0     0
2013     0     0  0.72   0.5     0     0     0     0
2011     0     0     0     0   0.2   0.5     0     0
2012     0     0     0     0   0.5   0.8     0     0
2013     0     0     0     0   0.8   0.9     0     0
n        0     0     0     0     0     0     n     n

...and so on (example values).

I want to artificially multiply my timestamps via this approach.

Is there an elegant way to do this?

You can convert what you have into what you want using the following code:

library(data.table)
dcast(
  melt(setDT(s), id="year")[, grp:=gsub("_.*$","",variable)],
  year+grp~variable,
  value.var="value"
  )[order(grp,year)]

Output:

    year    grp S1_f1 S1_f2 S2_f1 S2_f2 S3_f1 S3_f2
   <int> <char> <num> <num> <num> <num> <num> <num>
1:  2011     S1   0.1   0.4    NA    NA    NA    NA
2:  2012     S1   0.4   0.7    NA    NA    NA    NA
3:  2013     S1   0.7   0.9    NA    NA    NA    NA
4:  2011     S2    NA    NA  0.12  0.42    NA    NA
5:  2012     S2    NA    NA  0.42  0.72    NA    NA
6:  2013     S2    NA    NA  0.72  0.50    NA    NA
7:  2011     S3    NA    NA    NA    NA   0.2   0.5
8:  2012     S3    NA    NA    NA    NA   0.5   0.8
9:  2013     S3    NA    NA    NA    NA   0.8   0.9

Input:

structure(list(year = 2011:2013, S1_f1 = c(0.1, 0.4, 0.7), S1_f2 = c(0.4, 
0.7, 0.9), S2_f1 = c(0.12, 0.42, 0.72), S2_f2 = c(0.42, 0.72, 
0.5), S3_f1 = c(0.2, 0.5, 0.8), S3_f2 = c(0.5, 0.8, 0.9)), row.names = c(NA, 
-3L), class = "data.frame")

One possible way o solve your problem (note that I did not convert the data, say df , into a data.table ):

library(data.table)

result = sub("^S(\\d)+_.*", "\\1", names(df)[-1]) |> 
  unique() |> 
  lapply(function(i) df[sprintf(c("year", "S%s_f1", "S%s_f2"), i)]) |> 
  rbindlist(use.names=TRUE, fill=TRUE) |> 
  setnafill(fill=0)

    year S1_f1 S1_f2 S2_f1 S2_f2 S3_f1 S3_f2
   <int> <num> <num> <num> <num> <num> <num>
1:  2011   0.1   0.4  0.00  0.00   0.0   0.0
2:  2012   0.4   0.7  0.00  0.00   0.0   0.0
3:  2013   0.7   0.9  0.00  0.00   0.0   0.0
4:  2011   0.0   0.0  0.12  0.42   0.0   0.0
5:  2012   0.0   0.0  0.42  0.72   0.0   0.0
6:  2013   0.0   0.0  0.72  0.50   0.0   0.0
7:  2011   0.0   0.0  0.00  0.00   0.2   0.5
8:  2012   0.0   0.0  0.00  0.00   0.5   0.8
9:  2013   0.0   0.0  0.00  0.00   0.8   0.9

Using the sample data frame DF defined reproducibly in the Note at the end, create a vector g defining a grouping of the columns which is in the case of the example equals c("S1", "S1", "S2", "S2", "S3", "S3") . Then use it to split the columns into a list of matrices L , one matrix for each level of g . Apply .bdiag from the Matrix package to that list to create a block diagonal matrix and insert the year column and set the column names. Note that the Matrix package comes with R and does not have to be installed so this only uses base R.

library(Matrix)

g <- sub("_.*", "", names(DF)[-1])
L <- tapply(as.list(DF[-1]), g, function(x) as.matrix(as.data.frame(x)))
setNames(data.frame(DF$year, as.matrix(bdiag(L))), names(DF))

giving:

  year S1_f1 S1_f2 S2_f1 S2_f2 S3_f1 S3_f2
1 2011   0.1   0.4  0.00  0.00   0.0   0.0
2 2012   0.4   0.7  0.00  0.00   0.0   0.0
3 2013   0.7   0.9  0.00  0.00   0.0   0.0
4 2011   0.0   0.0  0.12  0.42   0.0   0.0
5 2012   0.0   0.0  0.42  0.72   0.0   0.0
6 2013   0.0   0.0  0.72  0.50   0.0   0.0
7 2011   0.0   0.0  0.00  0.00   0.2   0.5
8 2012   0.0   0.0  0.00  0.00   0.5   0.8
9 2013   0.0   0.0  0.00  0.00   0.8   0.9

Note

Lines <- "
year S1_f1  S1_f2 S2_f1 S2_f2 S3_f1 S3_f2
2011   0.1    0.4  0.12  0.42   0.2   0.5
2012   0.4    0.7  0.42  0.72   0.5   0.8
2013   0.7    0.9  0.72   0.5   0.8   0.9"
DF <- read.table(text = Lines, header = TRUE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM