简体   繁体   English

通过线性插值(时间系列)查找缺失值

[英]Find missing values by linear interpolation (time serie)

I have these data.frame called df1 which represents each month over three years (36 rows x 4 columns) : 我有这些名为df1 data.frame ,代表每个月超过三年(36行×4列):

       Year Month       v1       v2       v3
1  2015     1 15072.73 2524.102 17596.83
2  2015     2 15249.54 2597.265 17846.80
3  2015     3 15426.35 2670.427 18096.78
4  2015     4 15603.16 2743.590 18346.75
5  2015     5 15779.97 2816.752 18596.72
6  2015     6 15956.78 2889.915 18846.69
7  2015     7 16133.59 2963.077 19096.67
8  2015     8 16310.40 3036.240 19346.64
9  2015     9 16487.21 3109.402 19596.61
10 2015    10 16664.02 3182.565 19846.58
11 2015    11 16840.83 3255.727 20096.56
12 2015    12 17017.64 3328.890 20346.53
13 2016     1 17018.35 3328.890 20347.24
14 2016     2 17019.05 3328.890 20347.94
15 2016     3 17019.76 3328.890 20348.65
16 2016     4 17020.47 3328.890 20349.36
17 2016     5 17021.17 3328.890 20350.06
18 2016     6 17021.88 3328.890 20350.77
19 2016     7 17022.58 3328.890 20351.47
20 2016     8 17023.29 3328.890 20352.18
21 2016     9 17024.00 3328.890 20352.89
22 2016    10 17024.70 3328.890 20353.59
23 2016    11 17025.41 3328.890 20354.30
24 2016    12 17026.12 3328.890 20355.01
25 2017     1 17023.94 3328.890 20352.83
26 2017     2 17021.76 3328.890 20350.65
27 2017     3 17019.58 3328.890 20348.47
28 2017     4 17017.40 3328.890 20346.29
29 2017     5 17015.22 3328.890 20344.11
30 2017     6 17013.04 3328.890 20341.93
31 2017     7 17010.86 3328.890 20339.75
32 2017     8 17008.68 3328.890 20337.57
33 2017     9 17006.50 3328.890 20335.39
34 2017    10 17004.32 3328.890 20333.21
35 2017    11 17002.14 3328.890 20331.03
36 2017    12 17002.14 3328.890 20331.03

I want to interpolate all of these values in order to obtain interpolated values for all days of each month. 我想插入所有这些值,以获得每个月所有日子的插值。 They are in the data.frame called df2 (1096 x 1). 它们位于data.frame名为df2 (1096 x 1)。

df2 looks like : df2看起来像:

  seq(start, end, by = "days")
1                   2015-01-01
2                   2015-01-02
3                   2015-01-03
4                   2015-01-04
5                   2015-01-05
6                   2015-01-06

By this way I should obtain an output data.frame called results of 1096 rows (365 days (2015)+ 366 days(2016) + 365 days(2017)) and 4 columns. 通过这种方式,我应该获得输出data.frame称为results的1096行(365天(2015)+366天(2016)+365天(2017))和4列。

I have tried with approx : 我试过approx

results <- as.data.frame(approx(x = df1, y = NULL, xout = df2 ,
                             method = "linear"))

But it returns: 但它返回:

         x  y
1 2015-01-01 NA
2 2015-01-02 NA
3 2015-01-03 NA
4 2015-01-04 NA
5 2015-01-05 NA
6 2015-01-06 NA

Thanks for help! 感谢帮助!

For the sake of completeness, here is a solution which uses data.table . 为了完整起见,这是一个使用data.table的解决方案。

The OP has provided data points for each month of 2015 to 2017. He hasn't defined the day of month to which the values are attributed to. OP已提供2015年至2017年每个月的数据点。他尚未定义价值归属的月份日期。 Furthermore, he hasn't specified what type of interpolation he expects. 此外,他没有说明他期望的插值类型。

So, the given data look as follows (only v1 shown for simplicity): 因此,给定的数据如下所示(为简单起见,仅显示v1 ):

在此输入图像描述

Note that deliberately the monthly value was assigned to the first day of the month. 请注意,故意将月度值分配给该月的第一天。

There are different ways to interpolate data. 不同的方法来插入数据。 We will look at two of them. 我们将看看其中两个。

Piecewise constant interpolation 分段常数插值

As only one data point per month is given we can safely assume that the value is representative for each day of the respective month: 由于每月只提供一个数据点,我们可以安全地假设该值代表相应月份的每一天:

在此输入图像描述

(Plotted with geom_step() ) (用geom_step()绘制)

For interpolation, the base R function approx() is used. 对于插值,使用基本R函数approx() approx() is applied on all value columns v1 , v2 , v3 with help of lapply() . approx()被应用在所有的价值列v1v2v3与帮助lapply()

But first we need to turn the year-month into a full-flegded date (including day). 但首先,我们需要将年月变为完全崩溃的日期(包括日期)。 The first day of the month has been chosen deliberately. 本月的第一天是故意选择的。 Now, the data points in df1 are attributed to the dates 2015-01-01 to 2017-12-01. 现在, df1中的数据点归因于2015-01-01至2017-12-01的日期。 Note, that there is no given value for 2017-12-31 or 2018-01-01. 请注意,2017-12-31或2018-01-01没有给定值。

library(data.table)
library(magrittr)
# create date (assuming the 1st of month)
setDT(df1)[, date := as.IDate(paste(Year, Month, 1, sep = "-"))]
# create sequence of days covering the whole period
ds <- seq(as.IDate("2015-01-01"), as.IDate("2017-12-31"), by = "1 day")
# perform interpolation
cols = c("v1", "v2", "v3")
results <- df1[, c(.(date = ds), lapply(.SD, function(y) 
  approx(x = date, y = y, xout = ds, method = "constant", rule = 2)$y)), 
  .SDcols = cols]
results
  date v1 v2 v3 1: 2015-01-01 15072.73 2524.102 17596.83 2: 2015-01-02 15072.73 2524.102 17596.83 3: 2015-01-03 15072.73 2524.102 17596.83 4: 2015-01-04 15072.73 2524.102 17596.83 5: 2015-01-05 15072.73 2524.102 17596.83 --- 1092: 2017-12-27 17002.14 3328.890 20331.03 1093: 2017-12-28 17002.14 3328.890 20331.03 1094: 2017-12-29 17002.14 3328.890 20331.03 1095: 2017-12-30 17002.14 3328.890 20331.03 1096: 2017-12-31 17002.14 3328.890 20331.03 

By specifying rule = 2 , approx() was told to use the last given values (the ones for 2017-12-01) to complete the sequence up to 2017-12-31. 通过指定rule = 2approx()被告知使用最后给定的值(2017-12-01的值)来完成到2017-12-31的序列。

The result can be plotted on top of the given data points. 结果可以绘制在给定数据点的顶部。

在此输入图像描述

Piecewise linear interpolation 分段线性插值

For drawing a line segement, two points must be given. 为绘制线条,必须给出两点。 In order to draw line segments for 36 intervals (months), we need 37 data points. 为了绘制36个区间(月)的线段,我们需要37个数据点。 Unfortunately, the OP has given only 36 data points. 不幸的是,OP只提供了36个数据点。 We would need an additional data point for 2018-01-01 to draw a line for the last month. 我们需要2018-01-01的额外数据点来绘制上个月的一条线。

One of the options in this case is to assume that the values for the last month are constant. 在这种情况下,其中一个选项是假设上个月的值是不变的。 This is what approx() does when method = "linear" and rule = 2 is specified. method = "linear"并指定rule = 2时,这就是approx()所做的事情。

library(data.table)
library(magrittr)
# create date (assuming the 1st of month)
setDT(df1)[, date := as.IDate(paste(Year, Month, 1, sep = "-"))]
# create sequence of days covering the whole period
ds <- seq(as.IDate("2015-01-01"), as.IDate("2017-12-31"), by = "1 day")
# perform interpolation
cols = c("v1", "v2", "v3")
results <- df1[, c(.(date = ds), lapply(.SD, function(y) 
  approx(x = date, y = y, xout = ds, method = "linear", rule = 2)$y)), 
  .SDcols = cols]
results

            date       v1       v2       v3
   1: 2015-01-01 15072.73 2524.102 17596.83
   2: 2015-01-02 15078.43 2526.462 17604.89
   3: 2015-01-03 15084.14 2528.822 17612.96
   4: 2015-01-04 15089.84 2531.182 17621.02
   5: 2015-01-05 15095.54 2533.542 17629.08
  ---                                      
1092: 2017-12-27 17002.14 3328.890 20331.03
1093: 2017-12-28 17002.14 3328.890 20331.03
1094: 2017-12-29 17002.14 3328.890 20331.03
1095: 2017-12-30 17002.14 3328.890 20331.03
1096: 2017-12-31 17002.14 3328.890 20331.03

在此输入图像描述

In the sample dataset, the values for 2016 and 2017 are rather flat. 在样本数据集中,2016年和2017年的值相当平缓。 Constant interpolation for the last month isn't eye-catching, anyway. 无论如何,上个月的恒定插值并不引人注目。

You are almost there. 你快到了。 There are just some details that should be added. 只需要添加一些细节。

First of all, I have an impression, that you have omitted the year value from your data. 首先,我有一个印象,你已经从数据中省略了年份值。 However, it's important to have a year value when working with the dates. 但是,在处理日期时,使用年份值非常重要。 I suppose, you data should look like that: 我想,你的数据应该是这样的:

     Year Month   v1      v2          v3
1     2015     1 15072.73 2524.102   17596.83
2     2015     2 15249.54 2597.265   17846.80
3     2015     3 15426.35 2670.427   18096.78
4     2015     4 15603.16 2743.590   18346.75
5     2015     5 15779.97 2816.752   18596.72
6     2015     6 15956.78 2889.915   18846.69
7     2015     7 16133.59 2963.077   19096.67
8     2015     8 16310.40 3036.240   19346.64
9     2015     9 16487.21 3109.402   19596.61
10    2015    10 16664.02 3182.565   19846.58
11    2015    11 16840.83 3255.727   20096.56
12    2015    12 17017.64 3328.890   20346.53

Another question is which day of the month is implied for the monthly values given by df1 . 另一个问题是df1给出的月度值暗示了当月的哪一天。 Let's suppose that it is the first day of the month. 我们假设这是一个月的第一天。 Then the solution may be obtained that 然后可以获得该解决方案

data_names <- c("v1", "v2", "v3")
res_set <- lapply(
    function(var_name) approx(
        x = as.Date(paste(df1$Year, df1$Month, "01", sep = "-")), 
        y = df1[, var_name], xout = df2), 
    X = data_names)
# name each item of the list to make further work simpler
names(res_set) <- data_names
print(str(res_set))

Note, please, that the result of lapply() is a list. 请注意, lapply()的结果是一个列表。 Some additional work is needed to obtain a desirable format. 需要一些额外的工作来获得理想的格式。 If you need a single data frame for all the variables, than you may use: 如果所有变量都需要一个数据框,那么您可以使用:

res_df <- data.frame(x = df2, lapply(res_set,`[[`,  "y"))  

If you prefer a list of two-column data dframes, than an option is: 如果您更喜欢两列数据dframe的列表,那么选项是:

res_list <- lapply(res_set, as.data.frame)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM