I have a time series (monthly values over several years) of spatial data (originally ncdf) in an array. If there are more than 2 consecutive eg januaries with NA, I want to ban this pixel (now cell in the matrix of one time step) completely from further studies by putting it to NA in all time steps.
As far as I am concerned, "time.series" is only valid for vectors or matrices (maximum of two-dimensions).
One workaround I can see (but also not manage to implement) is: Resorting the array in a way that the order isn't purely chronological anymore but sorted by month (jan 2001, jan2002, jan 2003, feb 2001, feb 2002, feb 2003,...) would already help a lot. But it would leave the case that pixels get NA if eg. jan 2002, jan 2003 and feb 2001 are NA.
Any help would be really appreciated. Please ask if my question is unclear - it's my first one - I tried my best.
edit: My actual dataset is a global satellite based radiation dataset. Due to eg periodically appearing clouds (during rainseason in the same month every year) those pixel should not be considered any further. I also have some other criteria which eliminates pixel. Only that one criteria is missing.
# create any array with scattered NAs
set.seed (10)
array <- replicate(48, replicate(10, rnorm(20)))
na_pixels <- array((sample(c(1, NA), size = 7200, replace = TRUE, prob = c(0.95, 0.05))), dim = c(20,10,48))
na_array <- array * na_pixels
dimnames(na_array) <- list(NULL, NULL, as.character(seq(as.Date("2001-01-01"), as.Date("2004-12-01"), "month")))
#I want to test several conditions that would make a pixel not usable for me
#in the end I want to retrieve a mask of usable "pixels".
#what I am doing already is:
mask <- apply(na_array, MARGIN = c(1,2), FUN=function(x){
#check if more than 10% of a pixel are NA over time
if (sum(is.na(x)) > (length(x)*0.05)){
mask_val <- 0
}
#check if more than 5 pixel are missing consecutively
else if (max(with(rle(is.na(a)), lengths[values])) > 5){
mask_val <- 0
}
#this is the missing part
else if (...more than 2 januaries or 2 feburaries or... are NA){#check for periodically appearing NAs
mask_val <- 0
}
else {
mask_val <- 1
}
return(mask_val)
})
It's, probably, more convenient (if the necessary memory exists) to change your 3D array in a 'long' "data.frame":
as.data.frame(as.table(na_array))
# Var1 Var2 Var3 Freq
#1 A A 2001-01-01 0.01874617
#2 B A 2001-01-01 -0.18425254
#3 C A 2001-01-01 -1.37133055
# ...........................
#9598 R J 2004-12-01 NA
#9599 S J 2004-12-01 -1.11411416
#9600 T J 2004-12-01 0.01435433
Instead of relying on as.table
and as.data.frame
coercions, it could be done manually and more efficiently:
dat = data.frame(i = rep_len(seq_len(dim(na_array)[1]), prod(dim(na_array))),
j = rep_len(rep(seq_len(dim(na_array)[2]), each = dim(na_array)[1]), prod(dim(na_array))),
date = rep(as.Date(dimnames(na_array)[[3]]), each = prod(dim(na_array)[1:2])) ,
month = rep(format(as.Date(dimnames(na_array)[[3]]), "%b"), each = prod(dim(na_array)[1:2])),
isNA = c(is.na(na_array)))
dat
# i j date month isNA
#1 1 1 2001-01-01 Jan FALSE
#2 2 1 2001-01-01 Jan FALSE
#3 3 1 2001-01-01 Jan FALSE
#4 4 1 2001-01-01 Jan TRUE
# ..............
#9597 17 10 2004-12-01 Dec FALSE
#9598 18 10 2004-12-01 Dec TRUE
#9599 19 10 2004-12-01 Dec FALSE
#9600 20 10 2004-12-01 Dec FALSE
Where i
: row in na_array
, j
: column in na_array
, date
: 3rd dim of na_array
, month
: month of the date
column (as it will be needed later), isNA
: whether the value of na_array
is NA
.
And building the three conditions:
cond1 = aggregate(isNA ~ i + j, dat, function(x) sum(x) > (dim(na_array)[3] * 0.05))
(A more efficient way to create cond1
is rowSums(is.na(na_array), dims = 2) > (dim(na_array)[3] * 0.05)
).
cond2 = aggregate(isNA ~ i + j, dat, function(x) any(with(rle(x), lengths[values]) > 5))
And to compute cond3
, first find the number of missing values per "month" per each 'cell' (ie [i, j]) ("month" is a variable created/extracted from the dimnames(na_array)[[3]]
when creating the 'long' "data.frame" dat
in the beginning):
NA_per_month = aggregate(isNA ~ i + j + month, dat, function(x) sum(x))
Having the number of NA
s per "month" for each [i, j], we build cond3
by checking if each [i, j] contains any
"month" with more than 2 NA
s:
cond3 = aggregate(isNA ~ i + j, NA_per_month, function(x) any(x > 2))
(It's trivial to replace aggregate
in the above 'group-by' operations by any other available) .
Perhaps we could avoid creating a 'long' "data.frame" and operate on na_array
directly. For example, calculating cond1
with the rowSums
version is much more efficient and straightforward. cond2
, too, could be saved by an apply
on na_array
. But cond3
becomes much more straightforward with a 'long' "data.frame" rather than with a 3D array. So, accounting for efficiency, it's always better to try working with the structure present in the data and if it gets cumbersome enough, then we should probably change the structure of our data once and calculate anything in another scaffold than previously.
To get the final result, allocate a "matrix" of appropriate size:
ans = matrix(NA, dim(na_array)[1], dim(na_array)[2])
and fill in after OR
ing the conditions:
ans[cbind(cond1$i, cond1$j)] = cond1$isNA | cond2$isNA | cond3$isNA
ans
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
# [2,] TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE
# [3,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
# [4,] FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
# [5,] FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
# [6,] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
# [7,] FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
# [8,] TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
# [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#[10,] TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
#[11,] FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
#[12,] TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
#[13,] FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
#[14,] FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
#[15,] TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
#[16,] FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
#[17,] TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
#[18,] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE
#[19,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
#[20,] TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE
@ alexis_laz: Yes, this works now. Unfortunately I realised that the ans[cbind(cond1$i, cond1$j)] = cond1$isNA | cond2$isNA | cond3$isNA
ans[cbind(cond1$i, cond1$j)] = cond1$isNA | cond2$isNA | cond3$isNA
ans[cbind(cond1$i, cond1$j)] = cond1$isNA | cond2$isNA | cond3$isNA
is not working. I get the error: number of items to replace is not a multiple of replacement length. I think it only takes the cond1 for replacement. (I am sorry for my example dataset which gives 'FALSE' in all cases for cond2 and cond3 but still, it should check the 'OR' in the code.Even though the result will look the same like cond1) I came up with the following code, which works but is definately not niceor efficient because I am not too familiar with boolean stuff. Perhaps you could optimize my code or edit your line (as my real dataset is huge, i would be greatful fpr any optimization). In the far end I would need all True conditions (meaning NA) to be 0 and all FALSE conditions to be 1. That's why I already did this in my code here.
ans = matrix(NA, dim(na_array)[1], dim(na_array)[2])
cond1_bool <- ans
cond1_bool[cbind(cond1$i, cond1$j)] = cond1$isNA
cond2_bool <- ans
cond2_bool[cbind(cond2$i, cond2$j)] = cond2$isNA
cond3_bool <- ans
cond3_bool[cbind(cond3$i, cond3$j)] = cond3$isNA
ans_bool <- ans
ans_bool[which(cond1_bool == T|cond2_bool == T|cond3_bool == T)] <- 0
ans_bool[which(is.na(ans_bool))] <- 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.