Working with arrays in MongoDB

Updated 10-Jan-2024
Original post 15-Dec-2017
Like this? Dislike this? Let me know

A collection (pun!) of snippets that I find useful. Some of these are "easier" to deal with if you simply $unwind the arrays but unwinding big arrays in big objects can become very expensive. In general, be kind: don't unwind. Use $map,$reduce, and $filter whenever you can!

  1. Concatenating two or more arrays to produce one larger array
  2. Getting a set of unique entries in one or more arrays
  3. Finding where in an array of objects a field has a target value
  4. Deep filtering arrays of arrays of arrays of....

Concatenating two or more arrays to produce one larger array

Very often you find yourself with an array of arrays:
{"items": [ [0, 1, 2], [0], [2,5] ]}
Here is a means to concatenate them
db.foo.aggregate([
// This is unfortunately not as straightforward as simply doing
//  {newList: {$concatArrays: "$items"}}
// because $items is a single array of arrays, NOT a vararg list of
// arrays that is expected in $concatArrays.  In other words,
//   {$concatArrays: "$items"}}
// is not the same thing as
//   {$concatArrays: [ arr1, arr2, arr3 ]}
// from a call signature perspective.
// 
// Solution:  We will "walk" $items with $reduce and call $concatArrays over and
// over to build the list.  We use $items as the input and overwrite it for output.
// This is a clever use of $reduce by treating an array as the output value instead of a simple scalar:
{$project: {items: {$reduce: {
		     input: "$items",
		     initialValue:[],
                     in:{$concatArrays: [ "$$value", "$$this"]}
                    }}
            }}
]);

{ "_id" : 0, "items" : [ 0, 1, 2, 0, 2, 5 ] }

Getting a set of unique entries in one or more arrays

Essentially, the same as above but taking the resulting array and passing it to $setUnion:
db.foo.aggregate([
    {$project: { items: {"$setUnion": [
		     {$reduce: {
			 input: "$items",
			 initialValue:[],
			 in:{$concatArrays: [ "$$value", "$$this"]}
		     }} ] }
		}}

]);

{ "_id" : 0, "items" : [ 0, 1, 2, 5 ] }

Finding where in an array of objects a field has a target value

Given a doc like this:
{_id:0, items: [ {a:0}, {a:0}, {b:7}, {a:1},{b:0},{b:7} ] }
At what index offsets does b = 7 (for example?). Challenges:
  1. Can't use $filter. You will lose the positional info.
  2. Can't use the dot operator $items.b to turn the array of objects into an array of just element b because those objects that do not contain b are not pushed to the output array which again messes up proper positioning and indexing. Don't understand? Try this:
    db.foo.aggregate([{$project: {WW: "$items.b"}}]);
    and look at the output array:
    {_id: 0, "WW" : [ 7, 0, 7 ] }
The solution is effectively this:
    matches = [];
    for(i = 0; i < items.size(); i++) {
	   if(items[i]['b'] == 7) {
	      matches.push(i);
           }
    }		   
db.foo.aggregate([
	{$project: {
	    XX: {$reduce: {
	    // for $$this in range(0, {$size:"$items"})
	    input: {$range:[0,{$size:"$items"}]},
	    initialValue: [],  // Clever use of array as value.
	    in: {$cond: {
		    // Use getField as of v5.0:
		    if: {$eq:[7, {$getField: {input: {$arrayElemAt:['$items','$$this']}, field: 'b'}} ]},
		    then: {$concatArrays: [ '$$value', ['$$this'] ]},
		    else: '$$value' // nothing to do; return $$value unchanged
		}}
	    }}
	}}
]);

{ XX: [ 2, 5 ] }

Deep filtering arrays of arrays of arrays of....

Given a doc like this:
    {_id:0,
	 lvl1: [
             {a:1, lvl2: [
		 {b:1, lvl3: [
		     {c:1, z:'A'},
		     {c:2, z:'B'},
		     {c:3, z:'C'},
		     {c:4, z:'D'}
		 ]
		 },
		 {b:2, lvl3: [
		     {c:1, z:'W'},
		     {c:2, z:'X'},
		     {c:3, z:'Y'},
		     {c:4, z:'Z'}
		 ]
		 }
	     ]},
             {a:2, lvl2: [
		 {b:3, lvl3: [
		     {c:0, z:'A'},
		     {c:0, z:'B'},
		     {c:0, z:'C'},
		     {c:4, z:'D'}
		 ]
		 },
		 {b:4, lvl3: [
		     {c:4, z:'W'},
		     {c:3, z:'X'},
		     {c:2, z:'Y'},
		     {c:1, z:'Z'}
		 ]
		 }
	     ]}
	 ]
	}
How might we filter this to keep only lvl3.c > 2?

There is of course the initial urge to flatten everything out with `$unwind`:

db.foo.aggregate([
    {$unwind: '$lvl1'},
    {$unwind: '$lvl1.lvl2'},
    {$unwind: '$lvl1.lvl2.lvl3'},
    {$match: {'lvl1.lvl2.lvl3.c':{$gt:2}}},
    // ... now put it all back together with $group....
    ]);
but except for very small collections with small arrays at all 3 levels, this approach can potentially have very poor performance exploding a single doc into len(lvl1) x len(lvl2) x len(lvl3) docs. $unwind has better utility when we want to "expose" fields for regrouping; here, we are trying to filter not regroup.

A better approach is to have a cascade of $reduce operators to "dig through" the hierarchy and perform a $filter at the end. Conceptually, we want to do this (pseudocode):

    for l1 in lvl:
      for l2 in l1.lvl2:
        l2['lvl3'] = filter(l2['lvl3'], "c > 2")
It is understandably more complex in MQL but there a few key concepts that if kept in mind will make some of the instructions come together as a functional unit.
  1. We don't edit arrays "in place" e.g. arr[4] = {...} when processing; instead we always rebuild them, applying logic as we go.
  2. There is no such thing as object['key'] = value in MQL. Instead, we use the $mergeObjects function:
    {$mergeObjects: [ object, {key: value} ]}
    {$mergeObjects: [ object, {key1:v1, key2:v2, ...} ]}
    In other words, we overlay new key-value pairs on top of the old, either overwriting old values or adding new ones.
  3. When $reduce is used with an array input and an initialValue of an empty array ([]) with the appropriate logic, it "emulates" a for loop. $concatArrays plays an important role here and don't forget that $concatArrays requires all args to be arrays; you cannot append an object
  4. Always think of functional composition in MQL i.e. f(g(h())). To best understand the logic, start with the innermost function and work your way back out.
Here is the solution. It is deliberately spread out to add comments. We also add start/close tags to aid in understanding the blocks of logic but most editors will automatically highlight these regions so }} ] }]] } is not as intimidating as it looks.

db.foo.aggregate([
	{$addFields: { // $addFields to overwrite lvl1; use $project if you wishh

	    lvl1: {$reduce: {   // set lvl1...  open ff
		input: '$lvl1', // ...from lvl, i.e. overwrite

    		initialValue: [], // prep for array loop

    /* The basic loop is $concatArrays: [ $$value, [ one object ] ]
    Since $reduce is passing us one object at time as $$this, almost always
    we will see the second arg wrapped with [] to create an array of one. */
		in: {$concatArrays: [   // start ee
    		    '$$value',

    /* Our first key-value overlay stage.  But the overlay itself is a function;
    remember composition! */
		    [ {$mergeObjects: [   // start dd
			'$$this',
			{lvl2: {$reduce: {   // start cc
    			    input: '$$this.lvl2', // setup lvl2 for overwrite
    
    			    initialValue: [],

    // Second loop on lvl2:
			    in: {$concatArrays: [  // start bb
    				'$$value',

    /* Our second key-value overlay stage.  This time, the overlay does not
    set up a $reduce "loop" but instead performs the filter we seek: */
				[ {$mergeObjects: [
				    '$$this',
				    {lvl3: {$filter: {
					input: '$$this.lvl3',
					as: 'qq',
					cond:{$gt:['$$qq.c',2]}
				      }}
				    }
				]}
				] // list of 1

			    ]} // close bb
			}} 
    		      } // close cc
    
		    ]}]  // close dd incl array-of-one wrapper
    		]}  // close ee

	    }}  // close ff
	}}  // close addFields
    ]);
We show the more general approach using $mergeObjects because although it does tend to make things look complicated (esp. the array-of-one business...) it provides a template for adding -- including conditionally -- 2 or more fields. If the logic does not seek to automatically carry over all fields present in $$this and needs to explicitly assign a certain set of fields, this can be done directly without $mergeObjects:
Automatic carryover:
    {$concatArrays: [ '$$value', [ {$mergeObjects: [ '$$this', [ expr resolves to object ] ] }] ]}

Selective carryover:
    {$concatArrays: [ '$$value', [ expr resolves to object ]]}

    e.g. in the following, fields in $$this are dropped and only
    f1 and f2 are carried through; note if f1 and/or
    f2 did exist in $$this, they would be overwritten:

    {$concatArrays: [ '$$value', [ {f1:v1,f2:v2} ] ]}

Here is a more step-by-step way to approach the solution above:

  1. Start with a basic $addFields expression:
    db.foo.aggregate([
        {$addFields: {
            lvl1: '$lvl1' // noop
        }}
    ])
          
  2. Use the following snippet as a template loop. In fact, not a bad idea to just keep it in the code somewhere commented out for quick retrieval.
    {$reduce: {input:'$ARR',initialValue:[],in:{$concatArrays:['$$value',[{$mergeObjects:['$$this', {} ]}]]}}}
          
  3. Replace the value with the loop template, paying attention to properly substituting $ARR for the name of your actual field. If you run this there should be no change; that's good.
    db.foo.aggregate([
        {$addFields: {
            lvl1: {$reduce: {input:'$lvl1',initialValue:[],in:{$concatArrays:['$$value',[{$mergeObjects:['$$this', {} ]}]]}}}
        }}
    ])
          
  4. The {} as second argument to $mergeObjects is also a noop. Let's replace that with our actual array field but make sure you use $$this.lvl2 to use the document coming in from $reduce. Using just $lvl2 will use the value from the top-level document which is incorrect! For formatting and clarity purposes, we are putting the replacement on a new line. Again, running this should produce the same output:
    db.foo.aggregate([
        {$addFields: {
            lvl1: {$reduce: {input:'$lvl1',initialValue:[],in:{$concatArrays:['$$value',[{$mergeObjects:['$$this',
              {lvl2: '$$this.lvl2'}
            ]}]]}}}
        }}
    ])
          
  5. Replace '$$this.lvl2' with the template and replace $ARR. Also be mindful of the closing brace; we are substituting the template for '$$this.lvl2'; we must retain the closing brace:
    db.foo.aggregate([
        {$addFields: {
            lvl1: {$reduce: {input:'$lvl1',initialValue:[],in:{$concatArrays:['$$value',[{$mergeObjects:['$$this',
              {lvl2: {$reduce: {input:'$$this.lvl2',initialValue:[],in:{$concatArrays:['$$value',[{$mergeObjects:['$$this', {} ]}]]}}} }
            ]}]]}}}
        }}
    ])
          
  6. Replace the noop with the target array and verify, again, that output is the same:
    db.foo.aggregate([
        {$addFields: {
            lvl1: {$reduce: {input:'$lvl1',initialValue:[],in:{$concatArrays:['$$value',[{$mergeObjects:['$$this',
              {lvl2: {$reduce: {input:'$$this.lvl2',initialValue:[],in:{$concatArrays:['$$value',[{$mergeObjects:['$$this',
                {lvl3: '$$this.lvl3'}
            ]}]]}}} }
            ]}]]}}}
        }}
    ]);
          
  7. Replace '$$this.lvl3' with a $filter:
    db.foo.aggregate([
        {$addFields: {
            lvl1: {$reduce: {input:'$lvl1',initialValue:[],in:{$concatArrays:['$$value',[{$mergeObjects:['$$this',
              {lvl2: {$reduce: {input:'$$this.lvl2',initialValue:[],in:{$concatArrays:['$$value',[{$mergeObjects:['$$this',
                {lvl3: {$filter: {
    			input: '$$this.lvl3',
    			as: 'qq',
    			cond:{$gt:['$$qq.c',2]}
    		      }}  }
            ]}]]}}} }
            ]}]]}}}
        }}
    ]);
          
Success! With the exception of the closing braces and parens, it becomes a fairly straightforward expression. Let your editor get the closings correct.

Here is a fancy extension of the pipeline above that finds the highest value of c in the lvl3 array for each lvl2:
db.foo.aggregate([
	{$addFields: {
	    _id:false,
	    lvl1: {$reduce:
		   {input:'$lvl1',
		    initialValue:[],
		    in:{$concatArrays:['$$value',[
			{$mergeObjects:[
			    '$$this',

			    // We want to be able to not only get c > 2 but find the
			    // highest value of c at each lvl2.  For this we need to "hang on"
			    // to the filtered list so that we can assign it AND then traverse
			    // it to find the max c.  Remember you cannot use a variable you
			    // set in an object in the same "pass", e.g.
			    //   {
			    //     a: some list,
			    //     b: {$size: '$a'}  // nope
			    //   }
			    //
			    // but you can use $let to set up a value in vars then use it more
			    // than once in 'in'.
			    //
			    {$let: {
				vars: {tmp_l2: {$reduce: {input:'$$this.lvl2',initialValue:[],in:{$concatArrays:['$$value',[{$mergeObjects:['$$this',
				{lvl3: {$filter: {
				    input: '$$this.lvl3',
				    as: 'qq',
				    cond:{$gt:['$$qq.c',2]}
				}}  }
			       ]}]]} }} },
			       in: {
				   lvl2: '$$tmp_l2', // easy
				   // Need to create of max for inner lvl3, then the max of
				   // those for all of lvl2:
				   max_c: {$max: {$reduce: {input: '$$tmp_l2',
					    initialValue: [],
					    in:
					    // $$this is an object so we can dotpath to lvl3
					    // and apply one addition dotpath to
					    // extract an array of ONLY c:
					    {$concatArrays: ['$$value', [{$max: '$$this.lvl3.c'}]]}
					   }}
					  }

				} // close 'in'
			    }}
			]}
		    ]]
	       }
	   }}
	}}
]);

Like this? Dislike this? Let me know


Site copyright © 2013-2024 Buzz Moschetti. All rights reserved