简体   繁体   English

R - 时间序列数据的预处理

[英]R - preprocessing of time series data

I have the following data structure, with Stocks S , having features f :我有以下数据结构,带有 Stocks S ,具有特征f

year S1_f1  S1_f2 S2_f1 S2_f2 S3_f1 S3_f2 Sn_f1 Sn_f2
2011   0.1    0.4  0.12  0.42   0.2   0.5     n     n
2012   0.4    0.7  0.42  0.72   0.5   0.8     n     n
2013   0.7    0.9  0.72   0.5   0.8   0.9     n     n
n        n      n     n     n     n     n     n     n

My original df has 10 observations but 50k+ predictors - so I want to generate more balance on the observation side.我原来的 df 有 10 个观察值,但有 50k+ 个预测变量 - 所以我想在观察方面产生更多的平衡。

Hence, I want to have the following dataframe:因此,我想要以下 dataframe:

year S1_f1 S1_f2 S2_f1 S2_f2 S3_f1 S3_f2 Sn_f1 Sn_f2
2011   0.1   0.4     0     0     0     0     0     0
2012   0.4   0.7     0     0     0     0     0     0
2013   0.7   0.9     0     0     0     0     0     0
2011     0     0  0.12  0.42     0     0     0     0
2012     0     0  0.42  0.72     0     0     0     0
2013     0     0  0.72   0.5     0     0     0     0
2011     0     0     0     0   0.2   0.5     0     0
2012     0     0     0     0   0.5   0.8     0     0
2013     0     0     0     0   0.8   0.9     0     0
n        0     0     0     0     0     0     n     n

...and so on (example values). ...等等(示例值)。

I want to artificially multiply my timestamps via this approach.我想通过这种方法人为地增加我的时间戳。

Is there an elegant way to do this?有没有一种优雅的方式来做到这一点?

You can convert what you have into what you want using the following code:您可以使用以下代码将您拥有的内容转换为您想要的内容:

library(data.table)
dcast(
  melt(setDT(s), id="year")[, grp:=gsub("_.*$","",variable)],
  year+grp~variable,
  value.var="value"
  )[order(grp,year)]

Output: Output:

    year    grp S1_f1 S1_f2 S2_f1 S2_f2 S3_f1 S3_f2
   <int> <char> <num> <num> <num> <num> <num> <num>
1:  2011     S1   0.1   0.4    NA    NA    NA    NA
2:  2012     S1   0.4   0.7    NA    NA    NA    NA
3:  2013     S1   0.7   0.9    NA    NA    NA    NA
4:  2011     S2    NA    NA  0.12  0.42    NA    NA
5:  2012     S2    NA    NA  0.42  0.72    NA    NA
6:  2013     S2    NA    NA  0.72  0.50    NA    NA
7:  2011     S3    NA    NA    NA    NA   0.2   0.5
8:  2012     S3    NA    NA    NA    NA   0.5   0.8
9:  2013     S3    NA    NA    NA    NA   0.8   0.9

Input:输入:

structure(list(year = 2011:2013, S1_f1 = c(0.1, 0.4, 0.7), S1_f2 = c(0.4, 
0.7, 0.9), S2_f1 = c(0.12, 0.42, 0.72), S2_f2 = c(0.42, 0.72, 
0.5), S3_f1 = c(0.2, 0.5, 0.8), S3_f2 = c(0.5, 0.8, 0.9)), row.names = c(NA, 
-3L), class = "data.frame")

One possible way o solve your problem (note that I did not convert the data, say df , into a data.table ):解决您的问题的一种可能方法(请注意,我没有将数据(例如df )转换为data.table ):

library(data.table)

result = sub("^S(\\d)+_.*", "\\1", names(df)[-1]) |> 
  unique() |> 
  lapply(function(i) df[sprintf(c("year", "S%s_f1", "S%s_f2"), i)]) |> 
  rbindlist(use.names=TRUE, fill=TRUE) |> 
  setnafill(fill=0)

    year S1_f1 S1_f2 S2_f1 S2_f2 S3_f1 S3_f2
   <int> <num> <num> <num> <num> <num> <num>
1:  2011   0.1   0.4  0.00  0.00   0.0   0.0
2:  2012   0.4   0.7  0.00  0.00   0.0   0.0
3:  2013   0.7   0.9  0.00  0.00   0.0   0.0
4:  2011   0.0   0.0  0.12  0.42   0.0   0.0
5:  2012   0.0   0.0  0.42  0.72   0.0   0.0
6:  2013   0.0   0.0  0.72  0.50   0.0   0.0
7:  2011   0.0   0.0  0.00  0.00   0.2   0.5
8:  2012   0.0   0.0  0.00  0.00   0.5   0.8
9:  2013   0.0   0.0  0.00  0.00   0.8   0.9

Using the sample data frame DF defined reproducibly in the Note at the end, create a vector g defining a grouping of the columns which is in the case of the example equals c("S1", "S1", "S2", "S2", "S3", "S3") .使用最后在注释中可重复定义的样本数据框DF ,创建一个向量g定义列的分组,在示例的情况下等于c("S1", "S1", "S2", "S2", "S3", "S3") Then use it to split the columns into a list of matrices L , one matrix for each level of g .然后使用它将列拆分为矩阵L的列表,每个级别的g都有一个矩阵。 Apply .bdiag from the Matrix package to that list to create a block diagonal matrix and insert the year column and set the column names.将矩阵.bdiag中的 .bdiag 应用于该列表以创建块对角矩阵并插入年份列并设置列名。 Note that the Matrix package comes with R and does not have to be installed so this only uses base R.请注意,矩阵 package 随附 R 并且不必安装,因此仅使用基础 R。

library(Matrix)

g <- sub("_.*", "", names(DF)[-1])
L <- tapply(as.list(DF[-1]), g, function(x) as.matrix(as.data.frame(x)))
setNames(data.frame(DF$year, as.matrix(bdiag(L))), names(DF))

giving:给予:

  year S1_f1 S1_f2 S2_f1 S2_f2 S3_f1 S3_f2
1 2011   0.1   0.4  0.00  0.00   0.0   0.0
2 2012   0.4   0.7  0.00  0.00   0.0   0.0
3 2013   0.7   0.9  0.00  0.00   0.0   0.0
4 2011   0.0   0.0  0.12  0.42   0.0   0.0
5 2012   0.0   0.0  0.42  0.72   0.0   0.0
6 2013   0.0   0.0  0.72  0.50   0.0   0.0
7 2011   0.0   0.0  0.00  0.00   0.2   0.5
8 2012   0.0   0.0  0.00  0.00   0.5   0.8
9 2013   0.0   0.0  0.00  0.00   0.8   0.9

Note笔记

Lines <- "
year S1_f1  S1_f2 S2_f1 S2_f2 S3_f1 S3_f2
2011   0.1    0.4  0.12  0.42   0.2   0.5
2012   0.4    0.7  0.42  0.72   0.5   0.8
2013   0.7    0.9  0.72   0.5   0.8   0.9"
DF <- read.table(text = Lines, header = TRUE)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM