[英]How can I find periodically appearing NA values in an 3D array (along dimension time) with R
I have a time series (monthly values over several years) of spatial data (originally ncdf) in an array. 我有一个数组中的空间数据(最初为ncdf)的时间序列(月数值超过几年)。 If there are more than 2 consecutive eg januaries with NA, I want to ban this pixel (now cell in the matrix of one time step) completely from further studies by putting it to NA in all time steps.
如果有超过2个连续的例如带有NA的januaries,我想通过在所有时间步骤中将其置于NA来完全禁止进一步研究这个像素(现在是一个时间步长矩阵中的单元格)。
As far as I am concerned, "time.series" is only valid for vectors or matrices (maximum of two-dimensions). 就我而言,“time.series”仅对矢量或矩阵有效(最多为二维)。
One workaround I can see (but also not manage to implement) is: Resorting the array in a way that the order isn't purely chronological anymore but sorted by month (jan 2001, jan2002, jan 2003, feb 2001, feb 2002, feb 2003,...) would already help a lot. 我可以看到(但也没有设法实现)的一个解决方法是:按照顺序不再按照时间顺序排列数组,而是按月排序(jan 2001,jan2002,jan 2003,feb 2001,feb 2002,feb 2003年,...)已经有很多帮助了。 But it would leave the case that pixels get NA if eg.
但是如果例如像素那么像素会得到NA。 jan 2002, jan 2003 and feb 2001 are NA.
2002年1月,2003年1月和2001年2月是NA。
Any help would be really appreciated. 任何帮助将非常感激。 Please ask if my question is unclear - it's my first one - I tried my best.
请问我的问题是否不清楚 - 这是我的第一个 - 我尽我所能。
edit: My actual dataset is a global satellite based radiation dataset. 编辑:我的实际数据集是基于全球卫星的辐射数据集。 Due to eg periodically appearing clouds (during rainseason in the same month every year) those pixel should not be considered any further.
由于例如周期性出现的云(在每年同一个月的雨季期间),不应再考虑这些像素。 I also have some other criteria which eliminates pixel.
我还有其他一些消除像素的标准。 Only that one criteria is missing.
只缺少一个标准。
# create any array with scattered NAs
set.seed (10)
array <- replicate(48, replicate(10, rnorm(20)))
na_pixels <- array((sample(c(1, NA), size = 7200, replace = TRUE, prob = c(0.95, 0.05))), dim = c(20,10,48))
na_array <- array * na_pixels
dimnames(na_array) <- list(NULL, NULL, as.character(seq(as.Date("2001-01-01"), as.Date("2004-12-01"), "month")))
#I want to test several conditions that would make a pixel not usable for me
#in the end I want to retrieve a mask of usable "pixels".
#what I am doing already is:
mask <- apply(na_array, MARGIN = c(1,2), FUN=function(x){
#check if more than 10% of a pixel are NA over time
if (sum(is.na(x)) > (length(x)*0.05)){
mask_val <- 0
}
#check if more than 5 pixel are missing consecutively
else if (max(with(rle(is.na(a)), lengths[values])) > 5){
mask_val <- 0
}
#this is the missing part
else if (...more than 2 januaries or 2 feburaries or... are NA){#check for periodically appearing NAs
mask_val <- 0
}
else {
mask_val <- 1
}
return(mask_val)
})
It's, probably, more convenient (if the necessary memory exists) to change your 3D array in a 'long' "data.frame": 可能更方便(如果存在必要的内存)在“长”“data.frame”中更改3D数组:
as.data.frame(as.table(na_array))
# Var1 Var2 Var3 Freq
#1 A A 2001-01-01 0.01874617
#2 B A 2001-01-01 -0.18425254
#3 C A 2001-01-01 -1.37133055
# ...........................
#9598 R J 2004-12-01 NA
#9599 S J 2004-12-01 -1.11411416
#9600 T J 2004-12-01 0.01435433
Instead of relying on as.table
and as.data.frame
coercions, it could be done manually and more efficiently: 它不是依赖于
as.table
和as.data.frame
强制,而是可以手动完成并且更有效:
dat = data.frame(i = rep_len(seq_len(dim(na_array)[1]), prod(dim(na_array))),
j = rep_len(rep(seq_len(dim(na_array)[2]), each = dim(na_array)[1]), prod(dim(na_array))),
date = rep(as.Date(dimnames(na_array)[[3]]), each = prod(dim(na_array)[1:2])) ,
month = rep(format(as.Date(dimnames(na_array)[[3]]), "%b"), each = prod(dim(na_array)[1:2])),
isNA = c(is.na(na_array)))
dat
# i j date month isNA
#1 1 1 2001-01-01 Jan FALSE
#2 2 1 2001-01-01 Jan FALSE
#3 3 1 2001-01-01 Jan FALSE
#4 4 1 2001-01-01 Jan TRUE
# ..............
#9597 17 10 2004-12-01 Dec FALSE
#9598 18 10 2004-12-01 Dec TRUE
#9599 19 10 2004-12-01 Dec FALSE
#9600 20 10 2004-12-01 Dec FALSE
Where i
: row in na_array
, j
: column in na_array
, date
: 3rd dim of na_array
, month
: month of the date
column (as it will be needed later), isNA
: whether the value of na_array
is NA
. 凡
i
:行na_array
, j
:列na_array
, date
:第3暗淡na_array
, month
:在月date
柱(因为它会在以后需要的话), isNA
:的值是否na_array
是NA
。
And building the three conditions: 并建立三个条件:
cond1 = aggregate(isNA ~ i + j, dat, function(x) sum(x) > (dim(na_array)[3] * 0.05))
(A more efficient way to create cond1
is rowSums(is.na(na_array), dims = 2) > (dim(na_array)[3] * 0.05)
). (创建
cond1
更有效方法是rowSums(is.na(na_array), dims = 2) > (dim(na_array)[3] * 0.05)
)。
cond2 = aggregate(isNA ~ i + j, dat, function(x) any(with(rle(x), lengths[values]) > 5))
And to compute cond3
, first find the number of missing values per "month" per each 'cell' (ie [i, j]) ("month" is a variable created/extracted from the dimnames(na_array)[[3]]
when creating the 'long' "data.frame" dat
in the beginning): 并且为了计算
cond3
,首先找到每个“单元”每个“月”的缺失值的数量(即[i,j])(“月”是从dimnames(na_array)[[3]]
创建/提取的dimnames(na_array)[[3]]
在开头创建'long'“data.frame” dat
时:
NA_per_month = aggregate(isNA ~ i + j + month, dat, function(x) sum(x))
Having the number of NA
s per "month" for each [i, j], we build cond3
by checking if each [i, j] contains any
"month" with more than 2 NA
s: 对于每个[i,j],每个“月”具有
NA
s的数量,我们通过检查每个[i,j]是否包含具有超过2个NA
的any
“月”来构建cond3
:
cond3 = aggregate(isNA ~ i + j, NA_per_month, function(x) any(x > 2))
(It's trivial to replace aggregate
in the above 'group-by' operations by any other available) . (在上述“分组”操作中用任何其他可用的替换
aggregate
是微不足道的) 。
Perhaps we could avoid creating a 'long' "data.frame" and operate on na_array
directly. 也许我们可以避免创建一个“长”“data.frame”并直接在
na_array
上na_array
。 For example, calculating cond1
with the rowSums
version is much more efficient and straightforward. 例如,使用
rowSums
版本计算cond1
更加高效和简单。 cond2
, too, could be saved by an apply
on na_array
. cond2
也可以通过对na_array
的apply
来保存。 But cond3
becomes much more straightforward with a 'long' "data.frame" rather than with a 3D array. 但
cond3
使用“长”“data.frame”变得更加直接,而不是使用3D数组。 So, accounting for efficiency, it's always better to try working with the structure present in the data and if it gets cumbersome enough, then we should probably change the structure of our data once and calculate anything in another scaffold than previously. 因此,考虑到效率,尝试使用数据中存在的结构总是更好,如果它变得足够麻烦,那么我们应该改变一次数据的结构并计算另一个脚手架中的任何东西。
To get the final result, allocate a "matrix" of appropriate size: 要获得最终结果,请分配适当大小的“矩阵”:
ans = matrix(NA, dim(na_array)[1], dim(na_array)[2])
and fill in after OR
ing the conditions: 在
OR
条件之后填写:
ans[cbind(cond1$i, cond1$j)] = cond1$isNA | cond2$isNA | cond3$isNA
ans
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
# [2,] TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE
# [3,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
# [4,] FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
# [5,] FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
# [6,] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
# [7,] FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
# [8,] TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
# [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#[10,] TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
#[11,] FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
#[12,] TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
#[13,] FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
#[14,] FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
#[15,] TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
#[16,] FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
#[17,] TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
#[18,] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE
#[19,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
#[20,] TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE
@ alexis_laz: Yes, this works now. @ alexis_laz:是的,现在有效。 Unfortunately I realised that the
ans[cbind(cond1$i, cond1$j)] = cond1$isNA | cond2$isNA | cond3$isNA
不幸的是我意识到
ans[cbind(cond1$i, cond1$j)] = cond1$isNA | cond2$isNA | cond3$isNA
ans[cbind(cond1$i, cond1$j)] = cond1$isNA | cond2$isNA | cond3$isNA
ans[cbind(cond1$i, cond1$j)] = cond1$isNA | cond2$isNA | cond3$isNA
is not working. ans[cbind(cond1$i, cond1$j)] = cond1$isNA | cond2$isNA | cond3$isNA
不起作用。 I get the error: number of items to replace is not a multiple of replacement length. 我收到错误:要替换的项目数不是替换长度的倍数。 I think it only takes the cond1 for replacement.
我认为只需要cond1进行更换。 (I am sorry for my example dataset which gives 'FALSE' in all cases for cond2 and cond3 but still, it should check the 'OR' in the code.Even though the result will look the same like cond1) I came up with the following code, which works but is definately not niceor efficient because I am not too familiar with boolean stuff.
(我很抱歉我的示例数据集在cond2和cond3的所有情况下都给出'FALSE'但是,它应该检查代码中的'OR'。即使结果看起来像cond1一样)我想出了下面的代码,它可以工作,但绝对不是很好或有效,因为我不太熟悉布尔的东西。 Perhaps you could optimize my code or edit your line (as my real dataset is huge, i would be greatful fpr any optimization).
也许你可以优化我的代码或编辑你的行(因为我的真实数据集是巨大的,我会很高兴fpr任何优化)。 In the far end I would need all True conditions (meaning NA) to be 0 and all FALSE conditions to be 1. That's why I already did this in my code here.
在远端,我需要所有True条件(意味着NA)为0,所有FALSE条件为1.这就是为什么我已经在我的代码中这样做了。
ans = matrix(NA, dim(na_array)[1], dim(na_array)[2])
cond1_bool <- ans
cond1_bool[cbind(cond1$i, cond1$j)] = cond1$isNA
cond2_bool <- ans
cond2_bool[cbind(cond2$i, cond2$j)] = cond2$isNA
cond3_bool <- ans
cond3_bool[cbind(cond3$i, cond3$j)] = cond3$isNA
ans_bool <- ans
ans_bool[which(cond1_bool == T|cond2_bool == T|cond3_bool == T)] <- 0
ans_bool[which(is.na(ans_bool))] <- 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.