简体   繁体   中英

How can I find periodically appearing NA values in an 3D array (along dimension time) with R

I have a time series (monthly values over several years) of spatial data (originally ncdf) in an array. If there are more than 2 consecutive eg januaries with NA, I want to ban this pixel (now cell in the matrix of one time step) completely from further studies by putting it to NA in all time steps.

As far as I am concerned, "time.series" is only valid for vectors or matrices (maximum of two-dimensions).

One workaround I can see (but also not manage to implement) is: Resorting the array in a way that the order isn't purely chronological anymore but sorted by month (jan 2001, jan2002, jan 2003, feb 2001, feb 2002, feb 2003,...) would already help a lot. But it would leave the case that pixels get NA if eg. jan 2002, jan 2003 and feb 2001 are NA.

Any help would be really appreciated. Please ask if my question is unclear - it's my first one - I tried my best.

edit: My actual dataset is a global satellite based radiation dataset. Due to eg periodically appearing clouds (during rainseason in the same month every year) those pixel should not be considered any further. I also have some other criteria which eliminates pixel. Only that one criteria is missing.

# create any array with scattered NAs 
set.seed (10)
array <- replicate(48, replicate(10, rnorm(20)))
na_pixels <- array((sample(c(1, NA), size = 7200, replace = TRUE, prob = c(0.95, 0.05))), dim = c(20,10,48))
    na_array <- array * na_pixels

dimnames(na_array) <- list(NULL, NULL, as.character(seq(as.Date("2001-01-01"), as.Date("2004-12-01"), "month")))

#I want to test several conditions that would make a pixel not usable for me
#in the end I want to retrieve a mask of usable "pixels".
#what I am doing already is: 
mask <- apply(na_array, MARGIN = c(1,2), FUN=function(x){
  #check if more than 10% of a pixel are NA over time
  if (sum(is.na(x)) > (length(x)*0.05)){
    mask_val <- 0
  }
  #check if more than 5 pixel are missing consecutively
  else if (max(with(rle(is.na(a)), lengths[values])) > 5){ 
    mask_val <- 0
  }
  #this is the missing part
   else if (...more than 2 januaries or 2 feburaries or... are NA){#check for periodically appearing NAs
     mask_val <- 0
  }
  else {
    mask_val <- 1
  }
  return(mask_val)
}) 

It's, probably, more convenient (if the necessary memory exists) to change your 3D array in a 'long' "data.frame":

as.data.frame(as.table(na_array))
#     Var1 Var2       Var3        Freq
#1       A    A 2001-01-01  0.01874617
#2       B    A 2001-01-01 -0.18425254
#3       C    A 2001-01-01 -1.37133055
#       ...........................
#9598    R    J 2004-12-01          NA
#9599    S    J 2004-12-01 -1.11411416
#9600    T    J 2004-12-01  0.01435433

Instead of relying on as.table and as.data.frame coercions, it could be done manually and more efficiently:

dat = data.frame(i = rep_len(seq_len(dim(na_array)[1]), prod(dim(na_array))), 
                 j = rep_len(rep(seq_len(dim(na_array)[2]), each = dim(na_array)[1]), prod(dim(na_array))),
                 date = rep(as.Date(dimnames(na_array)[[3]]), each = prod(dim(na_array)[1:2])) , 
                 month = rep(format(as.Date(dimnames(na_array)[[3]]), "%b"), each = prod(dim(na_array)[1:2])), 
                 isNA = c(is.na(na_array)))
dat
#      i j       date month  isNA
#1     1 1 2001-01-01   Jan FALSE
#2     2 1 2001-01-01   Jan FALSE
#3     3 1 2001-01-01   Jan FALSE
#4     4 1 2001-01-01   Jan  TRUE
#          ..............
#9597 17 10 2004-12-01   Dec FALSE
#9598 18 10 2004-12-01   Dec  TRUE
#9599 19 10 2004-12-01   Dec FALSE
#9600 20 10 2004-12-01   Dec FALSE

Where i : row in na_array , j : column in na_array , date : 3rd dim of na_array , month : month of the date column (as it will be needed later), isNA : whether the value of na_array is NA .

And building the three conditions:

cond1 = aggregate(isNA ~ i + j, dat, function(x) sum(x) > (dim(na_array)[3] * 0.05))    

(A more efficient way to create cond1 is rowSums(is.na(na_array), dims = 2) > (dim(na_array)[3] * 0.05) ).

cond2 = aggregate(isNA ~ i + j, dat, function(x) any(with(rle(x), lengths[values]) > 5))

And to compute cond3 , first find the number of missing values per "month" per each 'cell' (ie [i, j]) ("month" is a variable created/extracted from the dimnames(na_array)[[3]] when creating the 'long' "data.frame" dat in the beginning):

NA_per_month = aggregate(isNA ~ i + j + month, dat, function(x) sum(x))

Having the number of NA s per "month" for each [i, j], we build cond3 by checking if each [i, j] contains any "month" with more than 2 NA s:

cond3 = aggregate(isNA ~ i + j, NA_per_month, function(x) any(x > 2))

(It's trivial to replace aggregate in the above 'group-by' operations by any other available) .

Perhaps we could avoid creating a 'long' "data.frame" and operate on na_array directly. For example, calculating cond1 with the rowSums version is much more efficient and straightforward. cond2 , too, could be saved by an apply on na_array . But cond3 becomes much more straightforward with a 'long' "data.frame" rather than with a 3D array. So, accounting for efficiency, it's always better to try working with the structure present in the data and if it gets cumbersome enough, then we should probably change the structure of our data once and calculate anything in another scaffold than previously.

To get the final result, allocate a "matrix" of appropriate size:

ans = matrix(NA, dim(na_array)[1], dim(na_array)[2])

and fill in after OR ing the conditions:

ans[cbind(cond1$i, cond1$j)] = cond1$isNA | cond2$isNA | cond3$isNA

ans
#       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
# [1,]  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
# [2,]  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE
# [3,] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
# [4,] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
# [5,] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
# [6,] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
# [7,] FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
# [8,]  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE
# [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
#[10,]  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
#[11,] FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
#[12,]  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
#[13,] FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE
#[14,] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE
#[15,]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE
#[16,] FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
#[17,]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE
#[18,] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE
#[19,] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
#[20,]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE

@ alexis_laz: Yes, this works now. Unfortunately I realised that the ans[cbind(cond1$i, cond1$j)] = cond1$isNA | cond2$isNA | cond3$isNA ans[cbind(cond1$i, cond1$j)] = cond1$isNA | cond2$isNA | cond3$isNA ans[cbind(cond1$i, cond1$j)] = cond1$isNA | cond2$isNA | cond3$isNA is not working. I get the error: number of items to replace is not a multiple of replacement length. I think it only takes the cond1 for replacement. (I am sorry for my example dataset which gives 'FALSE' in all cases for cond2 and cond3 but still, it should check the 'OR' in the code.Even though the result will look the same like cond1) I came up with the following code, which works but is definately not niceor efficient because I am not too familiar with boolean stuff. Perhaps you could optimize my code or edit your line (as my real dataset is huge, i would be greatful fpr any optimization). In the far end I would need all True conditions (meaning NA) to be 0 and all FALSE conditions to be 1. That's why I already did this in my code here.

ans = matrix(NA, dim(na_array)[1], dim(na_array)[2])
cond1_bool <- ans
cond1_bool[cbind(cond1$i, cond1$j)] = cond1$isNA
cond2_bool <- ans
cond2_bool[cbind(cond2$i, cond2$j)] = cond2$isNA
cond3_bool <- ans
cond3_bool[cbind(cond3$i, cond3$j)] = cond3$isNA
ans_bool <- ans
ans_bool[which(cond1_bool == T|cond2_bool == T|cond3_bool == T)] <- 0
ans_bool[which(is.na(ans_bool))] <- 1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM