find longest sequence of non nan values but allow for threshold

Question

Is it possible to find the non nan values of a vector but also allowing n number of nans? For example, if I have the following data:

X = [18 3 nan nan 8 10 11 nan 9 14 6 1 4 23 24]; %// input array
thres = 1; % this is the number of nans to allow

and I would like to only keep the longest sequence of values with non nans but allow 'n' number of nans to be kept in the data. So, say that I am willing to keep 1 nan I would have an output of

X_out = [8 10 11 nan 9 14 6 1 4 23 24]; %// output array

Thats is, the two nans at the beginning have been removed becuase they exceed the values in 'thres' above, but the third nan is on its own thus can be kept in the data. I would like to develop a method where thres can be defined as any value.

I can find the non nan values with

Y = ~isnan(X); %// convert to zeros and ones

Any ideas?

Answer 1

In order to find the longest sequence containing at most threshold times NaN we must find the start and the end of said sequence(s).

To generate all possible start points, we can use hankel :

H = hankel(X)

H =

    18     3   NaN   NaN     8    10    11   NaN     9    14     6     1     4    23    24
     3   NaN   NaN     8    10    11   NaN     9    14     6     1     4    23    24     0
   NaN   NaN     8    10    11   NaN     9    14     6     1     4    23    24     0     0
   NaN     8    10    11   NaN     9    14     6     1     4    23    24     0     0     0
     8    10    11   NaN     9    14     6     1     4    23    24     0     0     0     0
    10    11   NaN     9    14     6     1     4    23    24     0     0     0     0     0
    11   NaN     9    14     6     1     4    23    24     0     0     0     0     0     0
   NaN     9    14     6     1     4    23    24     0     0     0     0     0     0     0
     9    14     6     1     4    23    24     0     0     0     0     0     0     0     0
    14     6     1     4    23    24     0     0     0     0     0     0     0     0     0
     6     1     4    23    24     0     0     0     0     0     0     0     0     0     0
     1     4    23    24     0     0     0     0     0     0     0     0     0     0     0
     4    23    24     0     0     0     0     0     0     0     0     0     0     0     0
    23    24     0     0     0     0     0     0     0     0     0     0     0     0     0
    24     0     0     0     0     0     0     0     0     0     0     0     0     0     0

Now we need to find the last valid element in each row. To do so, we can use cumsum :

C = cumsum(isnan(H),2)

C =

     0     0     1     2     2     2     2     3     3     3     3     3     3     3     3
     0     1     2     2     2     2     3     3     3     3     3     3     3     3     3
     1     2     2     2     2     3     3     3     3     3     3     3     3     3     3
     1     1     1     1     2     2     2     2     2     2     2     2     2     2     2
     0     0     0     1     1     1     1     1     1     1     1     1     1     1     1
     0     0     1     1     1     1     1     1     1     1     1     1     1     1     1
     0     1     1     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0

The end point for each row is the one, where the corresponding element in C is at most threshold :

threshold = 1;

T = C<=threshold

T =

 1     1     1     0     0     0     0     0     0     0     0     0     0     0     0
 1     1     0     0     0     0     0     0     0     0     0     0     0     0     0
 1     0     0     0     0     0     0     0     0     0     0     0     0     0     0
 1     1     1     1     0     0     0     0     0     0     0     0     0     0     0
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
 1     1     1     1     1     1     1     1     1     1     1     1     1     1     1

The last valid element is found using:

[~,idx]=sort(T,2);
lastone=idx(:,end)

lastone =

 3     2     1     4    15    15    15    15    15    15    15    15    15    15    15

We must make sure that the actual length of each row is respected:

lengths = length(X):-1:1;
real_length = min(lastone,lengths);
[max_length,max_idx] = max(real_length)


max_length =

     11


max_idx =

     5

In case there are more sequences of equal maximum length, we just take the first and display it:

selected_max_idx = max_idx(1);
H(selected_max_idx, 1:max_length)


ans =

 8    10    11   NaN     9    14     6     1     4    23    24

full script

X = [18 3 nan nan 8 10 11 nan 9 14 6 1 4 23 24];

H = hankel(X);
C = cumsum(isnan(H),2);

threshold = 1;

T = C<=threshold;
[~,idx]=sort(T,2);
lastone=idx(:,end)';

lengths = length(X):-1:1;
real_length = min(lastone,lengths);
[max_length,max_idx] = max(real_length);

selected_max_idx = max_idx(1);
H(selected_max_idx, 1:max_length)

Answer 2

Approach 1: convolution

One possible approach is to convolve Y = double(~isnan(X)); with a window of n ones, where n is decreased by until an acceptable subsequence is found. "Acceptable" means that the subsequence contains at least n-thres ones, that is, the convolution gives at least n-thres .

Y = double(~isnan(X));
for n = numel(Y):-1:1 %// try all possible sequence lengths
    w = find(conv(Y,ones(1,n),'valid')>=n-thres); %// is there any acceptable subsequence?
    if ~isempty(w)
        break
    end
end
result = X(w:w+n-1);

Aproach 2: cumulative sum

Convolving Y with a window of n ones (as in approach 1) is equivalent to computing a cumulative sum of Y and then taking differences with n spacing. This is more efficient in terms of number of operations.

Y = double(~isnan(X));
Z = cumsum(Y);
for n = numel(Y):-1:1
    w = find([Z(n) Z(n+1:end)-Z(1:end-n)]>=n-thres);
    if ~isempty(w)
        break
    end
end
result = X(w:w+n-1);

Approach 3: 2D convolution

This essentially computes all iterations of the loop in approach 1 at once.

Y = double(~isnan(X));
z = conv2(Y, tril(ones(numel(Y))));
[nn, ww] = find(bsxfun(@ge, z, (1:numel(Y)).'-thres)); %'
[n, ind] = max(nn);
w = ww(ind)-n+1;
result = X(w:w+n-1);

Answer 3

Let's try my favorite tool: RLE. Matlab doesn't have a direct function, so use my "seqle" posted to exchange central. Seqle's default is to return run length encoding. So:

>> foo = [ nan 1 2 3 nan nan 4 5 6 nan 5 5 5 ];

>> seqle(isnan(foo))
ans = 
    run: [1 3 2 3 1 3]
    val: [1 0 1 0 1 0]

The "run" indicates the length of the current run; "val" indicates the value. In this case, val==1 indicates the value is nan and val==0 indicates numeric values. You can see it'll be relatively easy to extract the longest sequence of "run" values meeting the condition val==0 | run < 2 val==0 | run < 2 to get no more than one nan in a row. Then just grab the cumulative indices of that run and that's the subset of foo you want.

EDIT: sadly, what's trivial to find by eye may not be so easy to extract via code. I suspect there's a much faster way to use the indices identified by longrun to get the desired subsequence.

>> foo = [ nan 1 2 3 nan nan 4 5 6 nan nan 5 5 nan 5 nan 4 7 4 nan ];
>>  sfoo= seqle(isnan(foo))
sfoo = 
    run: [1 3 2 3 2 2 1 1 1 3 1]
    val: [1 0 1 0 1 0 1 0 1 0 1]
>> longrun = sfoo.run<2 |sfoo.val==0
longlong = 
    run: [2 1 1 1 6]
    val: [1 0 1 0 1]
% longrun identifies which indices might be part of a run
% longlong identifies the longest sequence of valid run 
>> longlong = seqle(longrun)
>> lfoo = find(sfoo.run<2 |sfoo.val==0);
>> sbar = seqle(lfoo,1);
>> maxind=find(sbar.run==max(sbar.run),1,'first');
>> getlfoo = lfoo( sum(sbar.run(1:(maxind-1)))+1 ); 
% first value in longrun , which is  part of max run
% getbar finds end of run indices
>> getbar = getlfoo:(getlfoo+sbar.run(maxind)-1);
>> getsbar = sfoo.run(getbar);
% retrieve indices of input vector 
>> startit = sum(sfoo.run(1:(getbar(1)-1))) +1;
>> endit = startit+ ((sum(sfoo.run(getbar(1):getbar(end ) ) ) ) )-1;
>> therun = foo( startit:endit )
therun =
     5     5   NaN     5   NaN     4     7     4   NaN

Answer 4

Hmmm, who doesn't like challenges, my solution is not as good as ms's, but it is an alternative.

X = [18 3 nan nan 8 10 11 nan 9 14 6 1 4 23 24]; %// input array
thresh =1;
X(isnan(X))= 0 ;

for i = 1:thresh
    Y(i,:) = circshift(X',-i); %//circular shift
end

For some reason, the Matlab invert " ' " makes the formatting looks weird.

D = X + sum(Y,1);

Discard = find(D==0)+thresh; %//give you the index of the part that needs to be discarded

chunk = find(X==0); %//Segment the Vector into segments delimited by NaNs

seriesOfZero = circshift(chunk',-1)' - chunk;

bigchunk =[1 chunk( find(seriesOfZero ~= 1)) size(X,2)]; %//Convert series of NaNs into 1 chunk

[values,DiscardChunk] = intersect(bigchunk,Discard);
DiscardChunk =  sort(DiscardChunk,'descend')

for t = 1:size(DiscardChunk,2)
  X(bigchunk(DiscardChunk(t)-1):bigchunk(DiscardChunk(t))) = []; %//Discard the data
end
X(X == 0) = NaN
%//End of Code

8 10 11 NaN 9 14 6 1 4 23 24

When: X = [18 3 nan nan nan 8 10 11 nan nan 9 14 6 1 nan nan nan 4 23 24]; %// input array thresh =2;

8 10 11 NaN 4 23 24

find longest sequence of non nan values but allow for threshold

Question

4 answers

solution1
8 ACCPTED 2015-09-08 12:42:24

solution2
5 2015-09-08 13:53:37

Approach 1: convolution

Aproach 2: cumulative sum

Approach 3: 2D convolution

solution3
1 2015-09-08 15:41:42

solution4
0 2015-09-08 13:31:44

find longest sequence of non nan values but allow for threshold

Question

4 answers

solution1 8 ACCPTED 2015-09-08 12:42:24

solution2 5 2015-09-08 13:53:37

Approach 1: convolution

Aproach 2: cumulative sum

Approach 3: 2D convolution

solution3 1 2015-09-08 15:41:42

solution4 0 2015-09-08 13:31:44

solution1
8 ACCPTED 2015-09-08 12:42:24

solution2
5 2015-09-08 13:53:37

solution3
1 2015-09-08 15:41:42

solution4
0 2015-09-08 13:31:44