简体   繁体   English

如何在R中找到3D阵列中的周期性出现的NA值(沿维度时间)

[英]How can I find periodically appearing NA values in an 3D array (along dimension time) with R

I have a time series (monthly values over several years) of spatial data (originally ncdf) in an array. 我有一个数组中的空间数据(最初为ncdf)的时间序列(月数值超过几年)。 If there are more than 2 consecutive eg januaries with NA, I want to ban this pixel (now cell in the matrix of one time step) completely from further studies by putting it to NA in all time steps. 如果有超过2个连续的例如带有NA的januaries,我想通过在所有时间步骤中将其置于NA来完全禁止进一步研究这个像素(现在是一个时间步长矩阵中的单元格)。

As far as I am concerned, "time.series" is only valid for vectors or matrices (maximum of two-dimensions). 就我而言,“time.series”仅对矢量或矩阵有效(最多为二维)。

One workaround I can see (but also not manage to implement) is: Resorting the array in a way that the order isn't purely chronological anymore but sorted by month (jan 2001, jan2002, jan 2003, feb 2001, feb 2002, feb 2003,...) would already help a lot. 我可以看到(但也没有设法实现)的一个解决方法是:按照顺序不再按照时间顺序排列数组,而是按月排序(jan 2001,jan2002,jan 2003,feb 2001,feb 2002,feb 2003年,...)已经有很多帮助了。 But it would leave the case that pixels get NA if eg. 但是如果例如像素那么像素会得到NA。 jan 2002, jan 2003 and feb 2001 are NA. 2002年1月,2003年1月和2001年2月是NA。

Any help would be really appreciated. 任何帮助将非常感激。 Please ask if my question is unclear - it's my first one - I tried my best. 请问我的问题是否不清楚 - 这是我的第一个 - 我尽我所能。

edit: My actual dataset is a global satellite based radiation dataset. 编辑:我的实际数据集是基于全球卫星的辐射数据集。 Due to eg periodically appearing clouds (during rainseason in the same month every year) those pixel should not be considered any further. 由于例如周期性出现的云(在每年同一个月的雨季期间),不应再考虑这些像素。 I also have some other criteria which eliminates pixel. 我还有其他一些消除像素的标准。 Only that one criteria is missing. 只缺少一个标准。

# create any array with scattered NAs 
set.seed (10)
array <- replicate(48, replicate(10, rnorm(20)))
na_pixels <- array((sample(c(1, NA), size = 7200, replace = TRUE, prob = c(0.95, 0.05))), dim = c(20,10,48))
    na_array <- array * na_pixels

dimnames(na_array) <- list(NULL, NULL, as.character(seq(as.Date("2001-01-01"), as.Date("2004-12-01"), "month")))

#I want to test several conditions that would make a pixel not usable for me
#in the end I want to retrieve a mask of usable "pixels".
#what I am doing already is: 
mask <- apply(na_array, MARGIN = c(1,2), FUN=function(x){
  #check if more than 10% of a pixel are NA over time
  if (sum(is.na(x)) > (length(x)*0.05)){
    mask_val <- 0
  }
  #check if more than 5 pixel are missing consecutively
  else if (max(with(rle(is.na(a)), lengths[values])) > 5){ 
    mask_val <- 0
  }
  #this is the missing part
   else if (...more than 2 januaries or 2 feburaries or... are NA){#check for periodically appearing NAs
     mask_val <- 0
  }
  else {
    mask_val <- 1
  }
  return(mask_val)
}) 

It's, probably, more convenient (if the necessary memory exists) to change your 3D array in a 'long' "data.frame": 可能更方便(如果存在必要的内存)在“长”“data.frame”中更改3D数组:

as.data.frame(as.table(na_array))
#     Var1 Var2       Var3        Freq
#1       A    A 2001-01-01  0.01874617
#2       B    A 2001-01-01 -0.18425254
#3       C    A 2001-01-01 -1.37133055
#       ...........................
#9598    R    J 2004-12-01          NA
#9599    S    J 2004-12-01 -1.11411416
#9600    T    J 2004-12-01  0.01435433

Instead of relying on as.table and as.data.frame coercions, it could be done manually and more efficiently: 它不是依赖于as.tableas.data.frame强制,而是可以手动完成并且更有效:

dat = data.frame(i = rep_len(seq_len(dim(na_array)[1]), prod(dim(na_array))), 
                 j = rep_len(rep(seq_len(dim(na_array)[2]), each = dim(na_array)[1]), prod(dim(na_array))),
                 date = rep(as.Date(dimnames(na_array)[[3]]), each = prod(dim(na_array)[1:2])) , 
                 month = rep(format(as.Date(dimnames(na_array)[[3]]), "%b"), each = prod(dim(na_array)[1:2])), 
                 isNA = c(is.na(na_array)))
dat
#      i j       date month  isNA
#1     1 1 2001-01-01   Jan FALSE
#2     2 1 2001-01-01   Jan FALSE
#3     3 1 2001-01-01   Jan FALSE
#4     4 1 2001-01-01   Jan  TRUE
#          ..............
#9597 17 10 2004-12-01   Dec FALSE
#9598 18 10 2004-12-01   Dec  TRUE
#9599 19 10 2004-12-01   Dec FALSE
#9600 20 10 2004-12-01   Dec FALSE

Where i : row in na_array , j : column in na_array , date : 3rd dim of na_array , month : month of the date column (as it will be needed later), isNA : whether the value of na_array is NA . i :行na_arrayj :列na_arraydate :第3暗淡na_arraymonth :在月date柱(因为它会在以后需要的话), isNA :的值是否na_arrayNA

And building the three conditions: 并建立三个条件:

cond1 = aggregate(isNA ~ i + j, dat, function(x) sum(x) > (dim(na_array)[3] * 0.05))    

(A more efficient way to create cond1 is rowSums(is.na(na_array), dims = 2) > (dim(na_array)[3] * 0.05) ). (创建cond1更有效方法是rowSums(is.na(na_array), dims = 2) > (dim(na_array)[3] * 0.05) )。

cond2 = aggregate(isNA ~ i + j, dat, function(x) any(with(rle(x), lengths[values]) > 5))

And to compute cond3 , first find the number of missing values per "month" per each 'cell' (ie [i, j]) ("month" is a variable created/extracted from the dimnames(na_array)[[3]] when creating the 'long' "data.frame" dat in the beginning): 并且为了计算cond3 ,首先找到每个“单元”每个“月”的缺失值的数量(即[i,j])(“月”是从dimnames(na_array)[[3]]创建/提取的dimnames(na_array)[[3]]在开头创建'long'“data.frame” dat时:

NA_per_month = aggregate(isNA ~ i + j + month, dat, function(x) sum(x))

Having the number of NA s per "month" for each [i, j], we build cond3 by checking if each [i, j] contains any "month" with more than 2 NA s: 对于每个[i,j],每个“月”具有NA s的数量,我们通过检查每个[i,j]是否包含具有超过2个NAany “月”来构建cond3

cond3 = aggregate(isNA ~ i + j, NA_per_month, function(x) any(x > 2))

(It's trivial to replace aggregate in the above 'group-by' operations by any other available) . (在上述“分组”操作中用任何其他可用的替换aggregate是微不足道的)

Perhaps we could avoid creating a 'long' "data.frame" and operate on na_array directly. 也许我们可以避免创建一个“长”“data.frame”并直接在na_arrayna_array For example, calculating cond1 with the rowSums version is much more efficient and straightforward. 例如,使用rowSums版本计算cond1更加高效和简单。 cond2 , too, could be saved by an apply on na_array . cond2也可以通过对na_arrayapply来保存。 But cond3 becomes much more straightforward with a 'long' "data.frame" rather than with a 3D array. cond3使用“长”“data.frame”变得更加直接,而不是使用3D数组。 So, accounting for efficiency, it's always better to try working with the structure present in the data and if it gets cumbersome enough, then we should probably change the structure of our data once and calculate anything in another scaffold than previously. 因此,考虑到效率,尝试使用数据中存在的结构总是更好,如果它变得足够麻烦,那么我们应该改变一次数据的结构并计算另一个脚手架中的任何东西。

To get the final result, allocate a "matrix" of appropriate size: 要获得最终结果,请分配适当大小的“矩阵”:

ans = matrix(NA, dim(na_array)[1], dim(na_array)[2])

and fill in after OR ing the conditions: OR条件之后填写:

ans[cbind(cond1$i, cond1$j)] = cond1$isNA | cond2$isNA | cond3$isNA

ans
#       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
# [1,]  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
# [2,]  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE
# [3,] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
# [4,] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
# [5,] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
# [6,] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
# [7,] FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
# [8,]  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE
# [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
#[10,]  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
#[11,] FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
#[12,]  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
#[13,] FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE
#[14,] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE
#[15,]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE
#[16,] FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
#[17,]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE
#[18,] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE
#[19,] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
#[20,]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE

@ alexis_laz: Yes, this works now. @ alexis_laz:是的,现在有效。 Unfortunately I realised that the ans[cbind(cond1$i, cond1$j)] = cond1$isNA | cond2$isNA | cond3$isNA 不幸的是我意识到ans[cbind(cond1$i, cond1$j)] = cond1$isNA | cond2$isNA | cond3$isNA ans[cbind(cond1$i, cond1$j)] = cond1$isNA | cond2$isNA | cond3$isNA ans[cbind(cond1$i, cond1$j)] = cond1$isNA | cond2$isNA | cond3$isNA is not working. ans[cbind(cond1$i, cond1$j)] = cond1$isNA | cond2$isNA | cond3$isNA不起作用。 I get the error: number of items to replace is not a multiple of replacement length. 我收到错误:要替换的项目数不是替换长度的倍数。 I think it only takes the cond1 for replacement. 我认为只需要cond1进行更换。 (I am sorry for my example dataset which gives 'FALSE' in all cases for cond2 and cond3 but still, it should check the 'OR' in the code.Even though the result will look the same like cond1) I came up with the following code, which works but is definately not niceor efficient because I am not too familiar with boolean stuff. (我很抱歉我的示例数据集在cond2和cond3的所有情况下都给出'FALSE'但是,它应该检查代码中的'OR'。即使结果看起来像cond1一样)我想出了下面的代码,它可以工作,但绝对不是很好或有效,因为我不太熟悉布尔的东西。 Perhaps you could optimize my code or edit your line (as my real dataset is huge, i would be greatful fpr any optimization). 也许你可以优化我的代码或编辑你的行(因为我的真实数据集是巨大的,我会很高兴fpr任何优化)。 In the far end I would need all True conditions (meaning NA) to be 0 and all FALSE conditions to be 1. That's why I already did this in my code here. 在远端,我需要所有True条件(意味着NA)为0,所有FALSE条件为1.这就是为什么我已经在我的代码中这样做了。

ans = matrix(NA, dim(na_array)[1], dim(na_array)[2])
cond1_bool <- ans
cond1_bool[cbind(cond1$i, cond1$j)] = cond1$isNA
cond2_bool <- ans
cond2_bool[cbind(cond2$i, cond2$j)] = cond2$isNA
cond3_bool <- ans
cond3_bool[cbind(cond3$i, cond3$j)] = cond3$isNA
ans_bool <- ans
ans_bool[which(cond1_bool == T|cond2_bool == T|cond3_bool == T)] <- 0
ans_bool[which(is.na(ans_bool))] <- 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM