简体   繁体   中英

Every 'nth' document from a collection - MongoDB + NodeJS

I am looking for a method to return data at different resolutions that is stored in MongoDB. The most elegant solution I can envision is a query that returns every 'nth' (second, third, tenth, etc.) document from the collection.

I am storing data (say temperature) at a 5 second interval but want to look at different trends in the data.

To find the instantaneous trend, I look at the last 720 entries (1 hour). This part is easy.

If I want to look at slightly longer trend, say 3 hours, I could retrieve the last 2160 entries (3 hours) however that is more time to pull from the server, and more time and memory to plot. As when looking at the larger trends, the small movements are noise and I would be better off retrieving the same number of documents (720) but only every 3rd, still giving me 3 hours of results, with the same resources used, for a minor sacrifice in detail.

This only gets more extreme when I want to look at weeks (120,960 documents) or months (500,000+ documents).

My current code collects every single document (n = 1):

db.collection(collection).find().sort({$natural:-1}).limit(limit)

I could then loop through the returned array and remove every document when:

index % n != 0

This at least saves the client from dealing with all the data however this seems extremely inefficient and I would rather the database handle this part.

Does anyone know a method to accomplish this?

Apparenlty, there is no inbuilt solution in mongo to solve your problem.

The way forward would be to archive your data smartly, in fragments.

So you can store your data in a collection which will house no more than weekly or monthly data. A new month/week means storing your data in a different collection. That way you wont be doing a full table scan and wont be collecting every single document as you mentioned in your problem. Your application code will decide which collection to query.

If I were in your shoes, I would use a different tool as mongo is more suited for a general purpose database. Timeseries data(storing something every 5 sec) can be handled pretty well by database like cassandra which can handle frequent writes with ease, just as in your case.

Alternate fragmentation(update) : Always write your current data in collection "week0" and in the background run a weekly scheduler that moves the data from "week0" to history collections "week1","week2" and so on. Fragmentation logic depends on your requirements.

I think the $bucket stage might help you with it. You can do something like,

db.collection.aggregate([
  {
    $bucketAuto: {
      groupBy: "$_id", // here you'll put the variable you need, in your example 'temperature'
      buckets: 5 // this is the number of documents you want to return, so if you want a sample of 500 documents, you can put 500 here
    }
  }
])

Each document in the result for the above query would be something like this,

    "_id": {
      "max": 3,
      "min": 1
    },
    "count": 2

If you had grouped by temperature, then each document will have the minimum and maximum temperature found in that sample

You might have another problem. Docs state not to rely on natural ordering:

This ordering is an internal implementation feature, and you should not rely on any particular structure within it.

You can instead save the epoch seconds in each document and do your mod arithmetic on it as part of a query, with limit and sort.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM