简体   繁体   中英

Resampling data monthly R or Python

I have data recorded in the format as below,

Input

name             year            value 
Afghanistan      1800            68
Albania          1800            23
Algeria          1800            54

Afghanistan      1801            59
Albania          1801            38
Algeria          1801            72

---
Afghanistan      2040            142
Albania          2040            165
Algeria          2040            120

I would like to resample all of my data which is recorded for years 1800 to 2040 using 1 month and exactly use the format as shown below,

Expected output

name             year            value 
Afghanistan      Jan 1800        5.6667  
Afghanistan      Feb 1800        11.3333    
Afghanistan      Mar 1800        17.0000    
Afghanistan      Apr 1800        22.6667    
Afghanistan      May 1800        28.3333    
Afghanistan      Jun 1800        34.0000    
Afghanistan      Jul 1800        39.6667    
Afghanistan      Aug 1800        45.3333    
Afghanistan      Sep 1800        51.0000
Afghanistan      Oct 1800        56.6667
Afghanistan      Nov 1800        62.3333
Afghanistan      Dec 1800        68.0000      
Albania          Jan 1800        1.9167
Albania          Feb 1800        3.8333
Albania          Mar 1800        5.7500
Albania          Apr 1800        7.6667
Albania          May 1800        9.5833
Albania          Jun 1800        11.5000
Albania          Jul 1800        13.4167
Albania          Aug 1800        15.3333
Albania          Sep 1800        17.2500
Albania          Oct 1800        19.1667
Albania          Nov 1800        21.0833
Albania          Dec 1800        23.0000
Algeria          Jan 1800        4.5000
Algeria          Feb 1800        9.0000
Algeria          Mar 1800        13.5000
Algeria          Apr 1800        18.0000
Algeria          May 1800        22.5000
Algeria          Jun 1800        27.0000
Algeria          Jul 1800        31.5000
Algeria          Aug 1800        36.0000
Algeria          Sep 1800        40.5000
Algeria          Oct 1800        45.0000
Algeria          Nov 1800        49.5000
Algeria          Dec 1800        54.000

I would like my data to look as above for all of the years, ie from 1800 - 2040. The value column is interpolated. NB: My model will accept months as abbreviations like above.

My closest trial is as below but did not produce the expected result.

data['year'] = pd.to_datetime(data.year, format='%Y')
data.head(3)   
    name                year                value
Afghanistan         1800-01-01 00:00:00     68
Albania             1800-01-01 00:00:00     23
Algeria             1800-01-01 00:00:00     54  

resampled = (data.groupby(['name']).apply(lambda x: x.set_index('year').resample('M').interpolate()))  

resampled.head(3)

name        year                 name  value                   
Afghanistan 1800-01-31 00:00:00  NaN    NaN
            1800-02-28 00:00:00  NaN    NaN
            1800-03-31 00:00:00  NaN    NaN

Your thoughts will save me here.

Here's a tidyverse approach that also requires the zoo package for the interpolation part.

library(dplyr)
library(tidyr)
library(zoo)

df <- data.frame(country = rep(c("Afghanistan", "Algeria"), each = 3),
                 year = rep(seq(1800,1802), times = 2),
                 value = rep(seq(3), times = 2),
                 stringsAsFactors = FALSE)

df2 <- df %>%
    # make a grid of all country/year/month possibilities within the years in df
    tidyr::expand(year, month = seq(12)) %>%
    # join that to the original data frame to add back the values
    left_join(., df) %>%
    # put the result in chronological order
    arrange(country, year, month) %>%
    # group by country so the interpolation stays within those sets
    group_by(country) %>%
    # make a version of value that is NA except for Dec, then use na.approx to replace
    # the NAs with linearly interpolated values
    mutate(value_i = ifelse(month == 12, value, NA),
           value_i = zoo::na.approx(value_i, na.rm = FALSE))

Note that the resulting column, value_i , is NA until the first valid observation, in December of the first observed year. So here's what the tail of df2 looks like.

> tail(df2)
# A tibble: 6 x 5
# Groups:   country [1]
   year month country value value_i
  <int> <int> <chr>   <int>   <dbl>
1  1802     7 Algeria     3    2.58
2  1802     8 Algeria     3    2.67
3  1802     9 Algeria     3    2.75
4  1802    10 Algeria     3    2.83
5  1802    11 Algeria     3    2.92
6  1802    12 Algeria     3    3 

If you want to replace those leading NAs, you'd have to do linear extrapolation, which you can do with na.spline from zoo instead. And if you'd rather have the observed values in January instead of December and get trailing instead of leading NAs, just change the relevant bit of the second-to-last line to month == 1 .

Apart from the imputeTS package for inter- as well as extrapolation, I only use base R in this solution.

res <- do.call(rbind, by(dat, dat$name, function(x) {
  ## expanding years to year-months
  ex <- do.call(rbind, lapply(1:nrow(x), function(i) {
    yr <- x$year[i]
    data.frame(name=x$name[1],
               year=seq.Date(as.Date(ISOdate(yr, 1, 1)), 
                             as.Date(ISOdate(yr, 12, 31)), "month"),
               value=x$value[i])
  }))
  ## set values to NA except 01-01s
  ex[!grepl("01-01", ex$year), "value"] <- NA
  transform(ex,
            ## impute values linearly
            value=imputeTS::na_interpolation(ex$value),
            ## format dates for desired output
            year=strftime(ex$year, format="%b-%Y")
            )
}))

Result

res[c(1:3, 13:15, 133:135, 145:147, 265:268, 277:279), ]  ## sample rows
#      name     year    value
# A.1     A Jan-1800 71.00000
# A.2     A Feb-1800 73.08333
# A.3     A Mar-1800 75.16667
# A.13    A Jan-1801 96.00000
# A.14    A Feb-1801 93.75000
# A.15    A Mar-1801 91.50000
# B.1     B Jan-1800 87.00000
# B.2     B Feb-1800 83.08333
# B.3     B Mar-1800 79.16667
# B.13    B Jan-1801 40.00000
# B.14    B Feb-1801 40.50000
# B.15    B Mar-1801 41.00000
# C.1     C Jan-1800 47.00000
# C.2     C Feb-1800 49.00000
# C.3     C Mar-1800 51.00000
# C.4     C Apr-1800 53.00000
# C.13    C Jan-1801 71.00000
# C.14    C Feb-1801 72.83333
# C.15    C Mar-1801 74.66667

Data

set.seed(42)
dat <- transform(expand.grid(name=LETTERS[1:3],
                             year=1800:1810),
                 value=sample(23:120, 33, replace=TRUE))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM