简体   繁体   中英

Slice according to an irregular pattern

Below, you can find an excerpt from a time series. I would like to slice the rows set apart by the white space (the white space is for clarity's sake, there is no empty row in the data). There is a recurring pattern, ie 21 times value X, four different values ABCD, 21 times value Y, four different values EFGH, 21 times value Z etc. In this case, I'm interested in obtaining BCDY, FGHZ and so on.

The problem is that this pattern is sometimes interrupted due to missing data, making it irregular. As a result, the number of values I want to discard (the values X, Y, Z) is sometimes lower than 21. Eg the values between 2014-01-20 00:05:00 and 2014-01-20 00:11:00 could as well be missing.

I can think of looping over the series, but I prefer a vectorized approach. I would like to implement it in R, but Python or Matlab will do as well.

Any ideas? Thanks.

2014-01-20 00:00:00    197021
2014-01-20 00:01:00    197021
2014-01-20 00:02:00    197021
2014-01-20 00:03:00    197021
2014-01-20 00:04:00    197021
2014-01-20 00:05:00    197021
2014-01-20 00:06:00    197021
2014-01-20 00:07:00    197021
2014-01-20 00:08:00    197021
2014-01-20 00:09:00    197021
2014-01-20 00:10:00    197021
2014-01-20 00:11:00    197021
2014-01-20 00:12:00    197021
2014-01-20 00:13:00    197021
2014-01-20 00:14:00    197021
2014-01-20 00:15:00    197021
2014-01-20 00:16:00    196836

2014-01-20 00:17:00    196865
2014-01-20 00:18:00    196787
2014-01-20 00:19:00    196915
2014-01-20 00:20:00    196902

2014-01-20 00:21:00    196902
2014-01-20 00:22:00    196902
2014-01-20 00:23:00    196902
2014-01-20 00:24:00    196902
2014-01-20 00:25:00    196902
2014-01-20 00:26:00    196902
2014-01-20 00:27:00    196902
2014-01-20 00:28:00    196902
2014-01-20 00:29:00    196902

If I understand you aright, you want to remove all data rows where the last column is unchanged from the previous row. In Matlab, you can do this using the diff() function and logical indexing. Assuming your data is in a two-column matrix, then the expression

data([true; diff(data(:,2))~=0],:)

will return a two-column matrix with just the data meeting the requirement. You might need to check the first row separately: I'm not quite clear from your description whether you always want the first row or not. The above will always keep it. Change the true to false to always drop it.

Edit (in response to the first comment)

Replacing true with false in the above expression discards the first row. This leaves you with blocks of 5 rows, of which you want to discard the first row in each block. This can also be done with logical indexing. It's fairly simple, except that you need to protect against the case when the last block contains less than 5 rows:

pData = data([false; diff(data(:,2))~=0],:);
selector = repmat([false; true; true; true; true], ceil(size(pData, 1))/5, 1);
pData = pData(selector(1:size(pData,1)),:);

I hope this helps!

I am embarassed to explain the logic of the following code because it seems evident to me.
If any precision is needed, do ask me.

I just give this explanation:
you wrote "the number of values I want to discard (the values X, Y, Z) is sometimes lower than 21."
In my code, the hypothesis I did is that this lower number can't be inferior to 3.

The use of StringIO isn't a requirement, it is just to give me the easiness to present a code runalble as is, treating data included in the code.
The only requirement is to pass an iterable to the function slik() : it can be a generator as in the following code, or a file handler, or a list, or anything else that can be iterated.

I split the code in order to make the executing instructions appear without having to scroll down, but the two parts must be in the same coe to run it.

text = '''
2014-01-20 00:00:00    197021
2014-01-20 00:01:00    197021
2014-01-20 00:02:00    197021
2014-01-20 00:03:00    197021
2014-01-20 00:04:00    197021
2014-01-20 00:05:00    197021
2014-01-20 00:06:00    197021
2014-01-20 00:07:00    197021
2014-01-20 00:08:00    197021
2014-01-20 00:09:00    197021
2014-01-20 00:10:00    197021
2014-01-20 00:11:00    197021
2014-01-20 00:12:00    197021
2014-01-20 00:13:00    197021
2014-01-20 00:14:00    197021
2014-01-20 00:15:00    197021
2014-01-20 00:16:00    196836
2014-01-20 00:17:00    196865
2014-01-20 00:18:00    196787
2014-01-20 00:19:00    196915
2014-01-20 00:20:00    196902
2014-01-20 00:21:00    196902
2014-01-20 00:22:00    196902
2014-01-20 00:23:00    196902
2014-01-20 00:24:00    196902
2014-01-20 00:25:00    196902
2014-01-20 00:26:00    196902
2014-01-20 00:27:00    196902
2014-01-20 00:28:00    196902
2014-01-20 00:29:00    196902
'''

.

from StringIO import StringIO
rid = (x.strip() for x in StringIO(text))

def slik(INPUT):
    from collections import deque
    rows = deque( ((x for x in INPUT if x).next(),)  , maxlen=10 )
    rows.extend ( INPUT.next() for i in xrange(8) )
    val  = deque( (x.rsplit(None,1)[1] for x in rows), maxlen=10)
    for x in INPUT:
        rows.append(x)
        val.append(x.rsplit(None,1)[1])
        if val[0]==val[1]==val[2] != val[7]==val[8]==val[9]:
            for i in (3,4,5,6,7):  yield rows[i]

print '\n'.join(slik(rid))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM