[英]Find missing values by linear interpolation (time serie)
I have these data.frame
called df1
which represents each month over three years (36 rows x 4 columns) : 我有这些名为
df1
data.frame
,代表每个月超过三年(36行×4列):
Year Month v1 v2 v3
1 2015 1 15072.73 2524.102 17596.83
2 2015 2 15249.54 2597.265 17846.80
3 2015 3 15426.35 2670.427 18096.78
4 2015 4 15603.16 2743.590 18346.75
5 2015 5 15779.97 2816.752 18596.72
6 2015 6 15956.78 2889.915 18846.69
7 2015 7 16133.59 2963.077 19096.67
8 2015 8 16310.40 3036.240 19346.64
9 2015 9 16487.21 3109.402 19596.61
10 2015 10 16664.02 3182.565 19846.58
11 2015 11 16840.83 3255.727 20096.56
12 2015 12 17017.64 3328.890 20346.53
13 2016 1 17018.35 3328.890 20347.24
14 2016 2 17019.05 3328.890 20347.94
15 2016 3 17019.76 3328.890 20348.65
16 2016 4 17020.47 3328.890 20349.36
17 2016 5 17021.17 3328.890 20350.06
18 2016 6 17021.88 3328.890 20350.77
19 2016 7 17022.58 3328.890 20351.47
20 2016 8 17023.29 3328.890 20352.18
21 2016 9 17024.00 3328.890 20352.89
22 2016 10 17024.70 3328.890 20353.59
23 2016 11 17025.41 3328.890 20354.30
24 2016 12 17026.12 3328.890 20355.01
25 2017 1 17023.94 3328.890 20352.83
26 2017 2 17021.76 3328.890 20350.65
27 2017 3 17019.58 3328.890 20348.47
28 2017 4 17017.40 3328.890 20346.29
29 2017 5 17015.22 3328.890 20344.11
30 2017 6 17013.04 3328.890 20341.93
31 2017 7 17010.86 3328.890 20339.75
32 2017 8 17008.68 3328.890 20337.57
33 2017 9 17006.50 3328.890 20335.39
34 2017 10 17004.32 3328.890 20333.21
35 2017 11 17002.14 3328.890 20331.03
36 2017 12 17002.14 3328.890 20331.03
I want to interpolate all of these values in order to obtain interpolated values for all days of each month. 我想插入所有这些值,以获得每个月所有日子的插值。 They are in the
data.frame
called df2
(1096 x 1). 它们位于
data.frame
名为df2
(1096 x 1)。
df2
looks like : df2
看起来像:
seq(start, end, by = "days")
1 2015-01-01
2 2015-01-02
3 2015-01-03
4 2015-01-04
5 2015-01-05
6 2015-01-06
By this way I should obtain an output data.frame
called results
of 1096 rows (365 days (2015)+ 366 days(2016) + 365 days(2017)) and 4 columns. 通过这种方式,我应该获得输出
data.frame
称为results
的1096行(365天(2015)+366天(2016)+365天(2017))和4列。
I have tried with approx
: 我试过
approx
:
results <- as.data.frame(approx(x = df1, y = NULL, xout = df2 ,
method = "linear"))
But it returns: 但它返回:
x y
1 2015-01-01 NA
2 2015-01-02 NA
3 2015-01-03 NA
4 2015-01-04 NA
5 2015-01-05 NA
6 2015-01-06 NA
Thanks for help! 感谢帮助!
For the sake of completeness, here is a solution which uses data.table
. 为了完整起见,这是一个使用
data.table
的解决方案。
The OP has provided data points for each month of 2015 to 2017. He hasn't defined the day of month to which the values are attributed to. OP已提供2015年至2017年每个月的数据点。他尚未定义价值归属的月份日期。 Furthermore, he hasn't specified what type of interpolation he expects.
此外,他没有说明他期望的插值类型。
So, the given data look as follows (only v1
shown for simplicity): 因此,给定的数据如下所示(为简单起见,仅显示
v1
):
Note that deliberately the monthly value was assigned to the first day of the month. 请注意,故意将月度值分配给该月的第一天。
There are different ways to interpolate data. 有不同的方法来插入数据。 We will look at two of them.
我们将看看其中两个。
As only one data point per month is given we can safely assume that the value is representative for each day of the respective month: 由于每月只提供一个数据点,我们可以安全地假设该值代表相应月份的每一天:
(Plotted with geom_step()
) (用
geom_step()
绘制)
For interpolation, the base R function approx()
is used. 对于插值,使用基本R函数
approx()
。 approx()
is applied on all value columns v1
, v2
, v3
with help of lapply()
. approx()
被应用在所有的价值列v1
, v2
, v3
与帮助lapply()
But first we need to turn the year-month into a full-flegded date (including day). 但首先,我们需要将年月变为完全崩溃的日期(包括日期)。 The first day of the month has been chosen deliberately.
本月的第一天是故意选择的。 Now, the data points in
df1
are attributed to the dates 2015-01-01 to 2017-12-01. 现在,
df1
中的数据点归因于2015-01-01至2017-12-01的日期。 Note, that there is no given value for 2017-12-31 or 2018-01-01. 请注意,2017-12-31或2018-01-01没有给定值。
library(data.table)
library(magrittr)
# create date (assuming the 1st of month)
setDT(df1)[, date := as.IDate(paste(Year, Month, 1, sep = "-"))]
# create sequence of days covering the whole period
ds <- seq(as.IDate("2015-01-01"), as.IDate("2017-12-31"), by = "1 day")
# perform interpolation
cols = c("v1", "v2", "v3")
results <- df1[, c(.(date = ds), lapply(.SD, function(y)
approx(x = date, y = y, xout = ds, method = "constant", rule = 2)$y)),
.SDcols = cols]
results
date v1 v2 v3 1: 2015-01-01 15072.73 2524.102 17596.83 2: 2015-01-02 15072.73 2524.102 17596.83 3: 2015-01-03 15072.73 2524.102 17596.83 4: 2015-01-04 15072.73 2524.102 17596.83 5: 2015-01-05 15072.73 2524.102 17596.83 --- 1092: 2017-12-27 17002.14 3328.890 20331.03 1093: 2017-12-28 17002.14 3328.890 20331.03 1094: 2017-12-29 17002.14 3328.890 20331.03 1095: 2017-12-30 17002.14 3328.890 20331.03 1096: 2017-12-31 17002.14 3328.890 20331.03
By specifying rule = 2
, approx()
was told to use the last given values (the ones for 2017-12-01) to complete the sequence up to 2017-12-31. 通过指定
rule = 2
, approx()
被告知使用最后给定的值(2017-12-01的值)来完成到2017-12-31的序列。
The result can be plotted on top of the given data points. 结果可以绘制在给定数据点的顶部。
For drawing a line segement, two points must be given. 为绘制线条,必须给出两点。 In order to draw line segments for 36 intervals (months), we need 37 data points.
为了绘制36个区间(月)的线段,我们需要37个数据点。 Unfortunately, the OP has given only 36 data points.
不幸的是,OP只提供了36个数据点。 We would need an additional data point for 2018-01-01 to draw a line for the last month.
我们需要2018-01-01的额外数据点来绘制上个月的一条线。
One of the options in this case is to assume that the values for the last month are constant. 在这种情况下,其中一个选项是假设上个月的值是不变的。 This is what
approx()
does when method = "linear"
and rule = 2
is specified. 当
method = "linear"
并指定rule = 2
时,这就是approx()
所做的事情。
library(data.table)
library(magrittr)
# create date (assuming the 1st of month)
setDT(df1)[, date := as.IDate(paste(Year, Month, 1, sep = "-"))]
# create sequence of days covering the whole period
ds <- seq(as.IDate("2015-01-01"), as.IDate("2017-12-31"), by = "1 day")
# perform interpolation
cols = c("v1", "v2", "v3")
results <- df1[, c(.(date = ds), lapply(.SD, function(y)
approx(x = date, y = y, xout = ds, method = "linear", rule = 2)$y)),
.SDcols = cols]
results
date v1 v2 v3
1: 2015-01-01 15072.73 2524.102 17596.83
2: 2015-01-02 15078.43 2526.462 17604.89
3: 2015-01-03 15084.14 2528.822 17612.96
4: 2015-01-04 15089.84 2531.182 17621.02
5: 2015-01-05 15095.54 2533.542 17629.08
---
1092: 2017-12-27 17002.14 3328.890 20331.03
1093: 2017-12-28 17002.14 3328.890 20331.03
1094: 2017-12-29 17002.14 3328.890 20331.03
1095: 2017-12-30 17002.14 3328.890 20331.03
1096: 2017-12-31 17002.14 3328.890 20331.03
In the sample dataset, the values for 2016 and 2017 are rather flat. 在样本数据集中,2016年和2017年的值相当平缓。 Constant interpolation for the last month isn't eye-catching, anyway.
无论如何,上个月的恒定插值并不引人注目。
You are almost there. 你快到了。 There are just some details that should be added.
只需要添加一些细节。
First of all, I have an impression, that you have omitted the year value from your data. 首先,我有一个印象,你已经从数据中省略了年份值。 However, it's important to have a year value when working with the dates.
但是,在处理日期时,使用年份值非常重要。 I suppose, you data should look like that:
我想,你的数据应该是这样的:
Year Month v1 v2 v3
1 2015 1 15072.73 2524.102 17596.83
2 2015 2 15249.54 2597.265 17846.80
3 2015 3 15426.35 2670.427 18096.78
4 2015 4 15603.16 2743.590 18346.75
5 2015 5 15779.97 2816.752 18596.72
6 2015 6 15956.78 2889.915 18846.69
7 2015 7 16133.59 2963.077 19096.67
8 2015 8 16310.40 3036.240 19346.64
9 2015 9 16487.21 3109.402 19596.61
10 2015 10 16664.02 3182.565 19846.58
11 2015 11 16840.83 3255.727 20096.56
12 2015 12 17017.64 3328.890 20346.53
Another question is which day of the month is implied for the monthly values given by df1
. 另一个问题是
df1
给出的月度值暗示了当月的哪一天。 Let's suppose that it is the first day of the month. 我们假设这是一个月的第一天。 Then the solution may be obtained that
然后可以获得该解决方案
data_names <- c("v1", "v2", "v3")
res_set <- lapply(
function(var_name) approx(
x = as.Date(paste(df1$Year, df1$Month, "01", sep = "-")),
y = df1[, var_name], xout = df2),
X = data_names)
# name each item of the list to make further work simpler
names(res_set) <- data_names
print(str(res_set))
Note, please, that the result of lapply()
is a list. 请注意,
lapply()
的结果是一个列表。 Some additional work is needed to obtain a desirable format. 需要一些额外的工作来获得理想的格式。 If you need a single data frame for all the variables, than you may use:
如果所有变量都需要一个数据框,那么您可以使用:
res_df <- data.frame(x = df2, lapply(res_set,`[[`, "y"))
If you prefer a list of two-column data dframes, than an option is: 如果您更喜欢两列数据dframe的列表,那么选项是:
res_list <- lapply(res_set, as.data.frame)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.