简体   繁体   中英

R: Subsetting matrix based on another matrix and running rle

Sample data:

      year <- rep(1980:2015, each = 365) 
      doy <- rep(1:365, times = 36)

      set.seed(125) 
      val <- sample(0:1, size = 365*36,replace = TRUE) 
      mat <- as.matrix(cbind(year,doy,val))

mat is data with years, doy and a value column which is either 1 or 0.

I have another matrix mat1 .

      set.seed(123) 
      mat1 <- apply(matrix(sample(c(230:365), replace = TRUE, size = 4L * 36L), nrow = 36L), 2L, sort)
      mat1 <- t(apply(mat1, 1, function(x) x[order(x)]))
      head(mat1)

            [,1] [,2] [,3] [,4]
      [1,]  230  231  233  236
      [2,]  235  238  242  242
      [3,]  236  242  243  246
      [4,]  243  245  247  249
      [5,]  247  248  249  250
      [6,]  249  250  253  263

mat1 also has 36 rows each for corresponding year in mat . mat1 contains some doy in four columns for each year.

I want to subset each year from mat using mat1 . For example, year 1980 in mat is to be subsetted in three groups:

      group 1 from 230 till 231 (1st and second column of row 1 from `mat1`)
      group 2 from 232 till 233 (second column + 1 to third column of row 1 from `mat1`)
      group 3 from 234 till 236 (third column + 1 to fourth column of row 1 from `mat1`)

This will give me three vectors for 1980. I want to do an rle on each vector to find the longest consecutive occurrence of 1. Something like

       group1.rle <- rle(group1)
       group2.rle <- rle(group2)
       group3.rle <- rle(group3)

       max(group1.rle$lengths[group1.rle$values == 1])
       max(group2.rle$lengths[group2.rle$values == 1])
       max(group3.rle$lengths[group3.rle$values == 1])

and then repeat this for all the years mat .

Thank you.

This might be a little bit overdone, but I got carried away in the tidyverse (who doesn't know that feeling? ;) )

Packages

# You might as well go with library(tidyverse)
library(dplyr)
library(purrr)
library(tidyr) 

Code & Explanation

# Preparation
mat <- as.data.frame(mat)
colnames(mat1) <- c("D1", "D2", "D3", "D4")
mat1 <- cbind(year = 1980:2015, mat1)
mat1 <- as.data.frame(mat1)

I name the columns of mat1 , add the column year and turn both mat and mat1 into nice tibble s so I use left_join on them. This is important to get the right indicator columns for the right years.

mat_new <- mat %>% 
  left_join(mat1, by = "year") %>% 
  mutate(group1 = (doy >= D1 & doy <=D2), 
         group2 = (doy >= D2 + 1 & doy <=D3),  
         group3 = (doy >= D3  + 1 & doy <=D4)) 

mat_new

# A tibble: 13,140 x 10
#     year   doy   val    D1    D2    D3    D4 group1 group2 group3
#    <int> <int> <int> <int> <int> <int> <int>  <lgl>  <lgl>  <lgl>
#  1  1980     1     1   230   231   233   236  FALSE  FALSE  FALSE
#  2  1980     2     0   230   231   233   236  FALSE  FALSE  FALSE
#  3  1980     3     0   230   231   233   236  FALSE  FALSE  FALSE
#  4  1980     4     0   230   231   233   236  FALSE  FALSE  FALSE
#  5  1980     5     1   230   231   233   236  FALSE  FALSE  FALSE
#  6  1980     6     1   230   231   233   236  FALSE  FALSE  FALSE
#  7  1980     7     1   230   231   233   236  FALSE  FALSE  FALSE
#  8  1980     8     0   230   231   233   236  FALSE  FALSE  FALSE
#  9  1980     9     1   230   231   233   236  FALSE  FALSE  FALSE
# 10  1980    10     1   230   231   233   236  FALSE  FALSE  FALSE
# ... with 13,130 more rows

So first join the two tibble and then add group columns whether the doy is in the subset.

mat_new <- mat_new %>% 
  gather(group, indicator, group1:group3) %>% 
  nest(doy, val, indicator) 

mat_new

# A tibble: 108 x 7
# year    D1    D2    D3    D4  group               data
# <int> <int> <int> <int> <int>  <chr>             <list>
#  1  1980   230   231   233   236 group1 <tibble [365 x 3]>
#  2  1981   235   238   242   242 group1 <tibble [365 x 3]>
#  3  1982   236   242   243   246 group1 <tibble [365 x 3]>
#  4  1983   243   245   247   249 group1 <tibble [365 x 3]>
#  5  1984   247   248   249   250 group1 <tibble [365 x 3]>
#  6  1985   249   250   253   263 group1 <tibble [365 x 3]>
#  7  1986   250   250   255   269 group1 <tibble [365 x 3]>
#  8  1987   255   258   259   269 group1 <tibble [365 x 3]>
#  9  1988   259   259   263   274 group1 <tibble [365 x 3]>
# 10  1989   261   270   273   285 group1 <tibble [365 x 3]>
# ... with 98 more rows

Gather the group columns together and then nest the data. Now each row contains one year - group combination and the data column stores the columns doy , val and indicator for this particular combination. This makes it easier to calculate the rle in the next step for all year - group combinations.

mat_new <- mat_new %>% 
 mutate(group.rle = map(data, ~ .x %>% filter(indicator) %>% pull(val) %>% rle), 
        max.group.rle = map_dbl(group.rle, ~max(.x$lengths[.x$values == 1])))

mat_new

# A tibble: 108 x 9
#     year    D1    D2    D3    D4  group               data group.rle max.group.rle
#    <int> <int> <int> <int> <int>  <chr>             <list>    <list>         <dbl>
#  1  1980   230   231   233   236 group1 <tibble [365 x 3]> <S3: rle>             1
#  2  1981   235   238   242   242 group1 <tibble [365 x 3]> <S3: rle>             2
#  3  1982   236   242   243   246 group1 <tibble [365 x 3]> <S3: rle>             1
#  4  1983   243   245   247   249 group1 <tibble [365 x 3]> <S3: rle>             1
#  5  1984   247   248   249   250 group1 <tibble [365 x 3]> <S3: rle>          -Inf
#  6  1985   249   250   253   263 group1 <tibble [365 x 3]> <S3: rle>             1
#  7  1986   250   250   255   269 group1 <tibble [365 x 3]> <S3: rle>             1
#  8  1987   255   258   259   269 group1 <tibble [365 x 3]> <S3: rle>             2
#  9  1988   259   259   263   274 group1 <tibble [365 x 3]> <S3: rle>          -Inf
# 10  1989   261   270   273   285 group1 <tibble [365 x 3]> <S3: rle>             2
# ... with 98 more rows

With two calls to map we can get the maximum rle for each group. In the first call, each tibble in the data column is filtered by the values stored in the indicator column, then the val column is extracted (with pull ) and then finally rle is applied to these values.

In the second map call, the rle stored in the group.rle column is filtered by your conditions (only lengths of values 1) and the max is calculated. As this returns a numeric vector of length 1, I use map_dbl to directly store the result as such.

Note, this call will produce warnings, as not all groups contain the value 1 and thus the maximum has no non-missing arguments after the filtering.

mat_new  %>% 
  select(year, group, max.group.rle) %>% 
  spread(group, max.group.rle)

# A tibble: 36 x 4
#     year group1 group2 group3
# *  <int>  <dbl>  <dbl>  <dbl>
# 1   1980      1   -Inf      1
# 2   1981      2      1   -Inf
# 3   1982      1   -Inf   -Inf
# 4   1983      1   -Inf      1
# 5   1984   -Inf      1   -Inf
# 6   1985      1   -Inf      2
# 7   1986      1      3      1
# 8   1987      2   -Inf      1
# 9   1988   -Inf   -Inf      2
# 10  1989      2      1      3
# # ... with 26 more rows  

To have a better look at the results, I only select the columns year , group , max.group.rle and then use spread , to spread the groups into separate columns. Now we have one line per year with the relevant information.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM