简体   繁体   中英

What's the best way to return a sample of data over the a period?

Let's say I have a collection with 3600 documents — one per second for the last hour — and each document has two fields: timestamp and value .

What is the best (read: most performant) method to select a sample of this data, say, 12 documents, with five minutes between? Or 60 documents, one per minute?

In reality, this collection will have tens of millions of records, and the query will be ran quite often, so performance really is key. With an index on the two fields a query filtering by timestamp > {one hour ago} is relatively quick on a collection with 200,000 records.

This post has been succeeded by Aggregating averages from large datasets for number of steps over period of time in ArangoDB .

I would go about it like this:

FOR doc IN Samples
FILTER doc.timestamp > @start AND doc.timestamp < @end
FILTER FLOOR(doc.timestamp/1000) % 300 == 0
RETURN doc

The timestamp is assumed to be millisecond based Unix timestamp, like what is returned by the DATE_NOW() function.

Where @start is the start timestamp of the period, and @end is the end of the period.

The above returns the first documents of each 5 minute time slice in the period. If you want one per minute then change the 300 to 60 in the formula. You can also change the 0 to something else if you want not the first document but the one that is X second from the beginning of that time slice.

One thing that can help improve the speed is if you stored the timestamp in second based Unix timestamp, because then the formula could be simpler: doc.timestamp % 300 == 0 requiring less calculations per documents.

And as mentioned in the comments, use a permanent index on the timestamp which will significantly speed up the first filter line.

The short answer to this is:

LET steps = 24
LET stepsRange = 0..23
LET diff = @end - @start
LET interval = diff / steps

FOR step IN stepsRange
RETURN FIRST(
    LET stepStart = start + (interval * step)
    LET stepEnd = stepStart + interval

    RETURN FIRST(
        FOR f IN filteredObservations
        FILTER f.timestamp >= stepStart AND f.timestamp <= stepEnd
        COLLECT AGGREGATE temperature = AVG(f.temperature)
        RETURN temperature
    )
)

For more details, see the answer in my question with superseded this one: https://stackoverflow.com/a/72886996/1138620

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM