5-Sep-2018 |
Like this? Dislike this? Let me know |

I have come across several unanswered questions on stackoverflow regarding how
to do moving averages in MongoDB. In the process of answering, I discovered
some irritating consequences of using very large arrays in the `$bucket`
function. I decided to capture the alternative, successful approach in more
detail here.

{"n":1, "val": 4.23, "date": new ISODate("2017-12-31") } {"n":2, "val": 4.23, "date": new ISODate("2018-01-01") } {"n":3, "val": 8.91, "date": new ISODate("2018-01-02") } {"n":4, "val": 0.03, "date": new ISODate("2018-01-03") } ... assume n and date keep increasing by 1 until March 1: {"n":61, "val": 6.44, "date": new ISODate("2018-03-01") }

// Make a bucket array of dates: var arr = []; ["2018-01-01","2018-01-15","2018-02-15","2018-02-28"].forEach(function(d) { arr.push(new ISODate(d)); }); db.collection.aggregate([ {$bucket: { groupBy: "$date", boundaries: arr, default: "Other", output: { "count": { $sum: 1 }, "avg": {$avg:"$val"} } } } ]);

{ "_id" : "Other", "count" : 3, "avg" : 63.435 } { "_id" : ISODate("2018-01-01T00:00:00Z"), "count" : 14, "avg" : 11.435 } { "_id" : ISODate("2018-01-15T00:00:00Z"), "count" : 31, "avg" : 33.95322580645161 } { "_id" : ISODate("2018-02-15T00:00:00Z"), "count" : 13, "avg" : 55.89153846153847 }

`$bucket`uses a lower bound inclusive, upper bound exclusive policy, so in the example above, 2018-01-01 to 2018-01-14 are included in the first bucket, 2018-01-15 to 2018-02-14 in the next, etc.- The lower bound value becomes the
`_id`of the group. - Because we are going
out 60 days, this overuns the end of February and the last bucket boundary of
2018-02-28; thus, the entries for 2018-02-28 and 2018-03-01 appear in the
`Other`bucket. - 2017-12-31 is
*before*the starting boundary of 2018-01-01 so it too is included in the`Other`bucket, for a total of 3 items.

// Let 1970-01-01 be the low date and 9999-01-01 be the high date. We are not dealing // with archaeological date manipulation (e.g. 100s of millions of years or more) // here and such use cases likely do not benefit from the datetime type anyway. ["1970-01-01", "2018-01-01","2018-01-15","2018-02-15","2018-02-28","9999-01-01"].forEach(function(d) { arr.push(new ISODate(d)); });

{ "_id" : ISODate("1970-01-01T00:00:00Z"), "count" : 1, "avg" : 5.5 } { "_id" : ISODate("2018-01-01T00:00:00Z"), "count" : 14, "avg" : 12.435 } { "_id" : ISODate("2018-01-15T00:00:00Z"), "count" : 31, "avg" : 34.91677419354839 } { "_id" : ISODate("2018-02-15T00:00:00Z"), "count" : 13, "avg" : 56.97846153846154 } { "_id" : ISODate("2018-02-28T00:00:00Z"), "count" : 2, "avg" : 64.435 }

The same approach can be used for numbers:

var arr = [10, 20, 30, 40]; db.foo.aggregate([ {$bucket: { groupBy: "$n", // grouping by n now, not date! boundaries: arr, default: "Other", output: { "count": { $sum: 1 }, "avg": {$avg: "$val"} } } } ]); { "_id" : 10, "count" : 10, "avg" : 18.435 } { "_id" : 20, "count" : 10, "avg" : 28.435000000000002 } { "_id" : 30, "count" : 10, "avg" : 38.434999999999995 } { "_id" : "Other", "count" : 31, "avg" : 41.24354838709677 }

var arr = [-Infinity, 0, 20, 30, 40, Infinity]; { "_id" : -Infinity, "count" : 9, "avg" : 8.997777777777777 } { "_id" : 10, "count" : 10, "avg" : 18.435 } { "_id" : 20, "count" : 10, "avg" : 28.435000000000002 } { "_id" : 30, "count" : 10, "avg" : 38.434999999999995 } { "_id" : 40, "count" : 22, "avg" : 54.434999999999995 }

{ "n" : 0, "val" : 107.05756543841737, "date" : ISODate("2018-01-01T00:00:00Z") } { "n" : 1, "val" : 112.15474746119355, "date" : ISODate("2018-01-01T00:00:01Z") } { "n" : 2, "val" : 114.9389559965861, "date" : ISODate("2018-01-01T00:00:02Z") } { "n" : 3, "val" : 100.96869281305015, "date" : ISODate("2018-01-01T00:00:03Z") } { "n" : 4, "val" : 119.4575361398694, "date" : ISODate("2018-01-01T00:00:04Z") } { "n" : 5, "val" : 117.81224531915713, "date" : ISODate("2018-01-01T00:00:05Z") } { "n" : 6, "val" : 101.8992107912693, "date" : ISODate("2018-01-01T00:00:06Z") } { "n" : 7, "val" : 118.00930576370166, "date" : ISODate("2018-01-01T00:00:07Z") } { "n" : 8, "val" : 111.10144964537338, "date" : ISODate("2018-01-01T00:00:08Z") } { "n" : 9, "val" : 109.83909498986932, "date" : ISODate("2018-01-01T00:00:09Z") } { "n" : 10, "val" : 108.16311721339869, "date" : ISODate("2018-01-01T00:00:10Z") } { "n" : 11, "val" : 109.8721409530211, "date" : ISODate("2018-01-01T00:00:11Z") } ...

var arr = []; var sv = new ISODate("2018-01-01").getTime(); var endv = new ISODate("2018-01-02").getTime(); var incr = (1000 * 5); // 5000 ms incremenet while(endv > sv) { arr.push(new Date(sv)); sv += incr; }

var arr = make array of 17280 dates; c=db.collection.aggregate([ no $bucket! ]); c.forEach(function(r) { var x =findSlot(arr, r['date']); if(buckets[x] == undefined) { buckets[x] = {lb: arr[x], ub: arr[x+1], n: 0, v:0, a:0}; } var zz = buckets[x]; zz['n']++; zz['v'] += r['val']; zz['a'] = zz['v']/zz['n']; });

function findSlotLinear(arr, val) { var max = arr.length - 1; for(var k = 0; k < max; k++) { if(val < arr[k+1]) { return k; } } }

But clearly, we'd rather aggregate in the database than write client-side code.

doc n in ms start in ms diff diff/1000 1514764800000 1514764800000 0 0 1514764801000 1514764800000 1000 1 1514764802000 1514764800000 2000 2 1514764803000 1514764800000 3000 3 1514764804000 1514764800000 4000 4 1514764805000 1514764800000 5000 5 1514764806000 1514764800000 6000 6 ...

doc n in ms start in ms diff /1000 div/5 floor 1514764800000 1514764800000 0 0 0 0 1514764801000 1514764800000 1000 1 0.2 0 1514764802000 1514764800000 2000 2 0.4 0 1514764803000 1514764800000 3000 3 0.6 0 1514764804000 1514764800000 4000 4 0.8 0 1514764805000 1514764800000 5000 5 1 1 1514764806000 1514764800000 6000 6 1.2 1 ...

secondsBucket = 5; startDate = new ISODate("2018-01-01"); db.collection.aggregate([ // Perform any or no $matching here to constrain the bucketing, e.g. // {$match:{"sym":sym, "date":{$gte:startDate, $lt:new Date(endv) }}} // The Juice! Get the floor of the diff / seconds to yield a "slot". // Also note use of $subtract on dates: ,{$addFields: {"slot": {$floor: {$divide: [ {$divide: [ {$subtract: [ "$date", startDate ]}, 1000.0 ]}, secondsBucket ] }} }} // Now just group on the slot. We throw in n for the count but // it is not really necessary... ,{$group: {_id: "$slot", n: {$sum:1}, avg: {$avg: "$val"}} } // Get it in 0-n order. Not vital but certainly useful: ,{$sort: {_id: 1}} ]); { "_id" : 0, "n" : 5, "avg" : 110.91549956982333 } { "_id" : 1, "n" : 5, "avg" : 111.72435237482561 } { "_id" : 2, "n" : 5, "avg" : 109.16745278376398 } ...

The array operators in MongoDB allow for simple, powerful manipulation of subsets of arrays, which is what we want from a moving average. Consider the following daily close data:

{ "val" : 10, "date" : ISODate("2018-01-01T00:00:00Z") } { "val" : 11.43, "date" : ISODate("2018-01-02T00:00:00Z") } { "val" : 12.52, "date" : ISODate("2018-01-03T00:00:00Z") } { "val" : 12.99, "date" : ISODate("2018-01-04T00:00:00Z") } { "val" : 12.72, "date" : ISODate("2018-01-05T00:00:00Z") } { "val" : 11.79, "date" : ISODate("2018-01-06T00:00:00Z") } { "val" : 10.42, "date" : ISODate("2018-01-07T00:00:00Z") } { "val" : 8.94, "date" : ISODate("2018-01-08T00:00:00Z") } { "val" : 7.72, "date" : ISODate("2018-01-09T00:00:00Z") } { "val" : 7.06, "date" : ISODate("2018-01-10T00:00:00Z") } { "val" : 7.12, "date" : ISODate("2018-01-11T00:00:00Z") } { "val" : 7.88, "date" : ISODate("2018-01-12T00:00:00Z") }

This is the solution but **note:** Our data is set up so that each
doc is one actual day apart; therefore, each doc as an *observation* (or
data point) is the same as a day. Our example solution is a general solution
for calculating moving averages over a given number of observations in an
input set, not days specifically. If your data has multiple datapoints per
day, then you will first have to `$group` the data to get it into the
"atom" that will become part of the moving average technique shown below.

// Control the size of the moving average frame: datapts = 4; db.collection.aggregate([ // Filter down to what you want. This can be anything or nothing at all. {$match: {"sym": "S1"}} // Ensure dates are going earliest to latest: ,{$sort: {d:1}} // Turn docs into a single doc with a big vector of observations, e.g. // {sym: "A", d: d1, val: 10} // {sym: "A", d: d2, val: 11} // {sym: "A", d: d3, val: 13} // becomes // {_id: "A", prx: [ {v:10,d:d1}, {v:11,d:d2}, {v:13,d:d3} ] } // // This will set us up to take advantage of array processing functions! ,{$group: {_id: "$sym", prx: {$push: {v:"$val",d:"$date"}} }} // Nice additional info. Note use of dot notation on array to get // just scalar date at elem 0, not the object {v:val,d:date}: ,{$addFields: {frameSize: datapts, startDate: {$arrayElemAt: [ "$prx.d", 0 ]}} } // The Juice! Basically, use the map function to start at index 0 and keep // slicing out subsets, calcing the average, and emitting that number. // // Note that we only run the vector to (len(vector) - (datapts-1). // Also, for extra info, we also add the as-of date which is the tail date // of the segment. // // Again we take advantage of dot notation to turn the vector of // object {v:val, d:date} into two vectors of simple scalars [v1,v2,...] // and [d1,d2,...] with $prx.v and $prx.d. Also, rather than create another // vector next to prx in the doc, we will overwrite the existing one (which // we don't need at the end anyway) by using $addFields with the same name (prx). // ,{$addFields: {"prx": {$map: { input: {$range:[0,{$subtract:[{$size:"$prx"}, (datapts-1)]}]} , as: "z", in: { avg: {$avg: {$slice: [ "$prx.v", "$$z", datapts ] } }, d: {$arrayElemAt: [ "$prx.d", {$add: ["$$z", (datapts-1)] } ]} } }} }} ]);

{ "_id" : "S1", "prx" : [ { "avg" : 11.738793632512115, "d" : ISODate("2018-01-04T00:00:00Z") }, { "avg" : 12.420766702631376, "d" : ISODate("2018-01-05T00:00:00Z") }, { "avg" : 12.510051656756191, "d" : ISODate("2018-01-06T00:00:00Z") }, ... { "avg" : 12.534681008446219, "d" : ISODate("2018-01-31T00:00:00Z") }, { "avg" : 12.08669329998585, "d" : ISODate("2018-02-01T00:00:00Z") } ], "frameSize" : 4, "startDate" : ISODate("2018-01-01T00:00:00Z") }

Like this? Dislike this? Let me know