[英]R: Subsetting matrix based on another matrix and running rle
Sample data: 样本数据:
year <- rep(1980:2015, each = 365)
doy <- rep(1:365, times = 36)
set.seed(125)
val <- sample(0:1, size = 365*36,replace = TRUE)
mat <- as.matrix(cbind(year,doy,val))
mat
is data with years, doy and a value column which is either 1 or 0. mat
是具有年,doy和值为1或0的值列的数据。
I have another matrix mat1
. 我还有另一个矩阵
mat1
。
set.seed(123)
mat1 <- apply(matrix(sample(c(230:365), replace = TRUE, size = 4L * 36L), nrow = 36L), 2L, sort)
mat1 <- t(apply(mat1, 1, function(x) x[order(x)]))
head(mat1)
[,1] [,2] [,3] [,4]
[1,] 230 231 233 236
[2,] 235 238 242 242
[3,] 236 242 243 246
[4,] 243 245 247 249
[5,] 247 248 249 250
[6,] 249 250 253 263
mat1
also has 36 rows each for corresponding year in mat
. mat1
在mat
也有对应于相应年份的36行。 mat1
contains some doy in four columns for each year. mat1
每年在四列中包含一些doy。
I want to subset each year from mat
using mat1
. 我想每年使用
mat1
从mat
mat1
集mat1
。 For example, year 1980 in mat
is to be subsetted in three groups: 例如,在
mat
上将1980年分为三类:
group 1 from 230 till 231 (1st and second column of row 1 from `mat1`)
group 2 from 232 till 233 (second column + 1 to third column of row 1 from `mat1`)
group 3 from 234 till 236 (third column + 1 to fourth column of row 1 from `mat1`)
This will give me three vectors for 1980. I want to do an rle on each vector to find the longest consecutive occurrence of 1. Something like 这将为我提供1980年的三个向量。我想对每个向量做一个rle来找出最长的连续出现1。
group1.rle <- rle(group1)
group2.rle <- rle(group2)
group3.rle <- rle(group3)
max(group1.rle$lengths[group1.rle$values == 1])
max(group2.rle$lengths[group2.rle$values == 1])
max(group3.rle$lengths[group3.rle$values == 1])
and then repeat this for all the years mat
. 然后重复多年的
mat
。
Thank you. 谢谢。
This might be a little bit overdone, but I got carried away in the tidyverse (who doesn't know that feeling? ;) ) 这可能有点过头了,但是我被tidyverse迷住了(谁不知道那种感觉?
Packages 包
# You might as well go with library(tidyverse)
library(dplyr)
library(purrr)
library(tidyr)
Code & Explanation 代码与解释
# Preparation
mat <- as.data.frame(mat)
colnames(mat1) <- c("D1", "D2", "D3", "D4")
mat1 <- cbind(year = 1980:2015, mat1)
mat1 <- as.data.frame(mat1)
I name the columns of mat1
, add the column year
and turn both mat
and mat1
into nice tibble
s so I use left_join
on them. 我的名字的列
mat1
,添加列year
,把两个mat
和mat1
到漂亮tibble
这么我使用left_join
他们。 This is important to get the right indicator columns for the right years. 这对于获得正确年份的正确指标列很重要。
mat_new <- mat %>%
left_join(mat1, by = "year") %>%
mutate(group1 = (doy >= D1 & doy <=D2),
group2 = (doy >= D2 + 1 & doy <=D3),
group3 = (doy >= D3 + 1 & doy <=D4))
mat_new
# A tibble: 13,140 x 10
# year doy val D1 D2 D3 D4 group1 group2 group3
# <int> <int> <int> <int> <int> <int> <int> <lgl> <lgl> <lgl>
# 1 1980 1 1 230 231 233 236 FALSE FALSE FALSE
# 2 1980 2 0 230 231 233 236 FALSE FALSE FALSE
# 3 1980 3 0 230 231 233 236 FALSE FALSE FALSE
# 4 1980 4 0 230 231 233 236 FALSE FALSE FALSE
# 5 1980 5 1 230 231 233 236 FALSE FALSE FALSE
# 6 1980 6 1 230 231 233 236 FALSE FALSE FALSE
# 7 1980 7 1 230 231 233 236 FALSE FALSE FALSE
# 8 1980 8 0 230 231 233 236 FALSE FALSE FALSE
# 9 1980 9 1 230 231 233 236 FALSE FALSE FALSE
# 10 1980 10 1 230 231 233 236 FALSE FALSE FALSE
# ... with 13,130 more rows
So first join the two tibble
and then add group columns whether the doy
is in the subset. 因此,首先将两个
tibble
,然后添加组列,以确认doy
是否在子集中。
mat_new <- mat_new %>%
gather(group, indicator, group1:group3) %>%
nest(doy, val, indicator)
mat_new
# A tibble: 108 x 7
# year D1 D2 D3 D4 group data
# <int> <int> <int> <int> <int> <chr> <list>
# 1 1980 230 231 233 236 group1 <tibble [365 x 3]>
# 2 1981 235 238 242 242 group1 <tibble [365 x 3]>
# 3 1982 236 242 243 246 group1 <tibble [365 x 3]>
# 4 1983 243 245 247 249 group1 <tibble [365 x 3]>
# 5 1984 247 248 249 250 group1 <tibble [365 x 3]>
# 6 1985 249 250 253 263 group1 <tibble [365 x 3]>
# 7 1986 250 250 255 269 group1 <tibble [365 x 3]>
# 8 1987 255 258 259 269 group1 <tibble [365 x 3]>
# 9 1988 259 259 263 274 group1 <tibble [365 x 3]>
# 10 1989 261 270 273 285 group1 <tibble [365 x 3]>
# ... with 98 more rows
Gather the group columns together and then nest
the data. 将组列聚集在一起,然后
nest
数据。 Now each row contains one year - group combination and the data column stores the columns doy
, val
and indicator
for this particular combination. 现在,每一行包含一年的组组合,并且data列存储此特定组合的
doy
, val
和indicator
列。 This makes it easier to calculate the rle
in the next step for all year - group combinations. 这样就可以更轻松地在下一步中为所有年份-组组合计算
rle
。
mat_new <- mat_new %>%
mutate(group.rle = map(data, ~ .x %>% filter(indicator) %>% pull(val) %>% rle),
max.group.rle = map_dbl(group.rle, ~max(.x$lengths[.x$values == 1])))
mat_new
# A tibble: 108 x 9
# year D1 D2 D3 D4 group data group.rle max.group.rle
# <int> <int> <int> <int> <int> <chr> <list> <list> <dbl>
# 1 1980 230 231 233 236 group1 <tibble [365 x 3]> <S3: rle> 1
# 2 1981 235 238 242 242 group1 <tibble [365 x 3]> <S3: rle> 2
# 3 1982 236 242 243 246 group1 <tibble [365 x 3]> <S3: rle> 1
# 4 1983 243 245 247 249 group1 <tibble [365 x 3]> <S3: rle> 1
# 5 1984 247 248 249 250 group1 <tibble [365 x 3]> <S3: rle> -Inf
# 6 1985 249 250 253 263 group1 <tibble [365 x 3]> <S3: rle> 1
# 7 1986 250 250 255 269 group1 <tibble [365 x 3]> <S3: rle> 1
# 8 1987 255 258 259 269 group1 <tibble [365 x 3]> <S3: rle> 2
# 9 1988 259 259 263 274 group1 <tibble [365 x 3]> <S3: rle> -Inf
# 10 1989 261 270 273 285 group1 <tibble [365 x 3]> <S3: rle> 2
# ... with 98 more rows
With two calls to map
we can get the maximum rle for each group. 通过两次调用
map
我们可以获得每个组的最大rle。 In the first call, each tibble
in the data
column is filtered by the values stored in the indicator
column, then the val
column is extracted (with pull
) and then finally rle
is applied to these values. 在第一个调用中,使用存储在
indicator
列中的值过滤data
列中的每个tibble
,然后提取出val
列(使用pull
),最后将rle
应用于这些值。
In the second map
call, the rle
stored in the group.rle
column is filtered by your conditions (only lengths of values 1) and the max
is calculated. 在第二个
map
呼叫时, rle
存储在group.rle
列由您的条件过滤(仅长度值1的)与max
被计算。 As this returns a numeric vector of length 1, I use map_dbl
to directly store the result as such. 由于这将返回长度为1的数字向量,因此我使用
map_dbl
直接将结果存储为此类。
Note, this call will produce warnings, as not all groups contain the value 1 and thus the maximum has no non-missing arguments after the filtering. 请注意,此调用将产生警告,因为并非所有组都包含值1,因此最大值在过滤后没有不丢失的参数。
mat_new %>%
select(year, group, max.group.rle) %>%
spread(group, max.group.rle)
# A tibble: 36 x 4
# year group1 group2 group3
# * <int> <dbl> <dbl> <dbl>
# 1 1980 1 -Inf 1
# 2 1981 2 1 -Inf
# 3 1982 1 -Inf -Inf
# 4 1983 1 -Inf 1
# 5 1984 -Inf 1 -Inf
# 6 1985 1 -Inf 2
# 7 1986 1 3 1
# 8 1987 2 -Inf 1
# 9 1988 -Inf -Inf 2
# 10 1989 2 1 3
# # ... with 26 more rows
To have a better look at the results, I only select the columns year
, group
, max.group.rle
and then use spread
, to spread the groups into separate columns. 为了更好地查看结果,我只选择了
year
, group
, max.group.rle
列,然后使用spread
来将组分布到单独的列中。 Now we have one line per year with the relevant information. 现在,我们每年只有一条线提供相关信息。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.