繁体   English   中英

每月重采样数据 R 或 Python

[英]Resampling data monthly R or Python

我有以下格式记录的数据,

输入

name             year            value 
Afghanistan      1800            68
Albania          1800            23
Algeria          1800            54

Afghanistan      1801            59
Albania          1801            38
Algeria          1801            72

---
Afghanistan      2040            142
Albania          2040            165
Algeria          2040            120

我想使用 1 个月重新采样我在1800 年到 2040年记录的所有数据,并完全使用如下所示的格式,

预期输出

name             year            value 
Afghanistan      Jan 1800        5.6667  
Afghanistan      Feb 1800        11.3333    
Afghanistan      Mar 1800        17.0000    
Afghanistan      Apr 1800        22.6667    
Afghanistan      May 1800        28.3333    
Afghanistan      Jun 1800        34.0000    
Afghanistan      Jul 1800        39.6667    
Afghanistan      Aug 1800        45.3333    
Afghanistan      Sep 1800        51.0000
Afghanistan      Oct 1800        56.6667
Afghanistan      Nov 1800        62.3333
Afghanistan      Dec 1800        68.0000      
Albania          Jan 1800        1.9167
Albania          Feb 1800        3.8333
Albania          Mar 1800        5.7500
Albania          Apr 1800        7.6667
Albania          May 1800        9.5833
Albania          Jun 1800        11.5000
Albania          Jul 1800        13.4167
Albania          Aug 1800        15.3333
Albania          Sep 1800        17.2500
Albania          Oct 1800        19.1667
Albania          Nov 1800        21.0833
Albania          Dec 1800        23.0000
Algeria          Jan 1800        4.5000
Algeria          Feb 1800        9.0000
Algeria          Mar 1800        13.5000
Algeria          Apr 1800        18.0000
Algeria          May 1800        22.5000
Algeria          Jun 1800        27.0000
Algeria          Jul 1800        31.5000
Algeria          Aug 1800        36.0000
Algeria          Sep 1800        40.5000
Algeria          Oct 1800        45.0000
Algeria          Nov 1800        49.5000
Algeria          Dec 1800        54.000

我希望我的数据在所有年份(即从 1800 年到 2040 年)都如上所示。值列是内插的。 注意:我的模型将接受月份作为上述缩写。

我最近的试验如下,但没有产生预期的结果。

data['year'] = pd.to_datetime(data.year, format='%Y')
data.head(3)   
    name                year                value
Afghanistan         1800-01-01 00:00:00     68
Albania             1800-01-01 00:00:00     23
Algeria             1800-01-01 00:00:00     54  

resampled = (data.groupby(['name']).apply(lambda x: x.set_index('year').resample('M').interpolate()))  

resampled.head(3)

name        year                 name  value                   
Afghanistan 1800-01-31 00:00:00  NaN    NaN
            1800-02-28 00:00:00  NaN    NaN
            1800-03-31 00:00:00  NaN    NaN

你的想法会在这里救我。

这是一个tidyverse方法,它也需要用于插值部分的zoo包。

library(dplyr)
library(tidyr)
library(zoo)

df <- data.frame(country = rep(c("Afghanistan", "Algeria"), each = 3),
                 year = rep(seq(1800,1802), times = 2),
                 value = rep(seq(3), times = 2),
                 stringsAsFactors = FALSE)

df2 <- df %>%
    # make a grid of all country/year/month possibilities within the years in df
    tidyr::expand(year, month = seq(12)) %>%
    # join that to the original data frame to add back the values
    left_join(., df) %>%
    # put the result in chronological order
    arrange(country, year, month) %>%
    # group by country so the interpolation stays within those sets
    group_by(country) %>%
    # make a version of value that is NA except for Dec, then use na.approx to replace
    # the NAs with linearly interpolated values
    mutate(value_i = ifelse(month == 12, value, NA),
           value_i = zoo::na.approx(value_i, na.rm = FALSE))

请注意,结果列value_i在第一个观察年份的 12 月第一次有效观察之前为NA 所以这就是df2的尾巴的样子。

> tail(df2)
# A tibble: 6 x 5
# Groups:   country [1]
   year month country value value_i
  <int> <int> <chr>   <int>   <dbl>
1  1802     7 Algeria     3    2.58
2  1802     8 Algeria     3    2.67
3  1802     9 Algeria     3    2.75
4  1802    10 Algeria     3    2.83
5  1802    11 Algeria     3    2.92
6  1802    12 Algeria     3    3 

如果你想替换那些领先的 ​​NA,你必须做线性外推,你可以用zoo na.spline来代替。 如果您更希望在 1 月而不是 12 月获得观察值,并获得尾随而不是前导 NA,只需将倒数第二行的相关位更改为month == 1

除了用于内插和外推的imputeTS包外,我在此解决方案中仅使用基础 R。

res <- do.call(rbind, by(dat, dat$name, function(x) {
  ## expanding years to year-months
  ex <- do.call(rbind, lapply(1:nrow(x), function(i) {
    yr <- x$year[i]
    data.frame(name=x$name[1],
               year=seq.Date(as.Date(ISOdate(yr, 1, 1)), 
                             as.Date(ISOdate(yr, 12, 31)), "month"),
               value=x$value[i])
  }))
  ## set values to NA except 01-01s
  ex[!grepl("01-01", ex$year), "value"] <- NA
  transform(ex,
            ## impute values linearly
            value=imputeTS::na_interpolation(ex$value),
            ## format dates for desired output
            year=strftime(ex$year, format="%b-%Y")
            )
}))

结果

res[c(1:3, 13:15, 133:135, 145:147, 265:268, 277:279), ]  ## sample rows
#      name     year    value
# A.1     A Jan-1800 71.00000
# A.2     A Feb-1800 73.08333
# A.3     A Mar-1800 75.16667
# A.13    A Jan-1801 96.00000
# A.14    A Feb-1801 93.75000
# A.15    A Mar-1801 91.50000
# B.1     B Jan-1800 87.00000
# B.2     B Feb-1800 83.08333
# B.3     B Mar-1800 79.16667
# B.13    B Jan-1801 40.00000
# B.14    B Feb-1801 40.50000
# B.15    B Mar-1801 41.00000
# C.1     C Jan-1800 47.00000
# C.2     C Feb-1800 49.00000
# C.3     C Mar-1800 51.00000
# C.4     C Apr-1800 53.00000
# C.13    C Jan-1801 71.00000
# C.14    C Feb-1801 72.83333
# C.15    C Mar-1801 74.66667

数据

set.seed(42)
dat <- transform(expand.grid(name=LETTERS[1:3],
                             year=1800:1810),
                 value=sample(23:120, 33, replace=TRUE))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM