jq vs. python: Which is Better for Scripting With JSON?

2-Feb-2021

jq (available here) is the de facto standard for command line / shell script utilities dealing with JSON. In the new JSON data ecosystem, it is as prevalent and important and useful as sed and awk. Thanks to a compact expression syntax and a good approach to dealing with arrays, jq makes it easy to extract values from complex JSON structures.
But is it really the best multipurpose tool for the job, especially as the data manipulation requirements become increasingly complex? As your process control, data interrogation, and command input/output needs become more complex, running it all in python is clearly more attractive.

Some Quick Examples In Defense of jq

Consider this JSON:

{
  "sector":"ABC",
  "items": [
    {"name":"corn", "id":1, "hist": [
    		    {"d":"2020-01-01","v":100},
		    {"d":"2020-01-02","v":200}
		    ]},
    {"name":"wheat", "id":2, "hist": [
    		     {"d":"2020-01-03","v":300},
		     {"d":"2020-01-04","v":400}
		     ]},
    {"name":"rice", "id":3, "hist": [
    		     {"d":"2021-01-03","v":500},
		     {"d":"2021-01-04","v":600}
		     ]}
   ]
}

How might we find the value of the last entry in the hist array for everything except corn?


cat thatJson | jq -r '.items[] | select(.name != "corn") | "\(.name) was \(.hist[-1].v)" ' 
wheat was 400
rice was 600

There is plenty of material on the web to describe the jq commands and expressions but let us just run through our example:

The dot represents the incoming object. Adding Field names after the dot is essentially dot notation, allowing you to drill into a structure e.g. .fld.fld2.fld3
The array operator [] with no arguments (e.g. [] not [0]) "unwinds" the array, turning into a sequential set of individual objects
... which are then piped with the | operator into a select statement that excludes corn
... which are then piped to a string expression that use interpolation to pluck the name and the last ([-1]) history array values and print them.

The -r (raw) option tells jq to emit strings without the proper JSON double quote wrappers, which is useful in shell scripting.

Many cloud provider CLIs return complex JSON shapes; jq is a superb way to work with this content. For example, launching an AWS VM returns a complex structure -- but your VM is not ready yet. You must poll to find out when it is actually running. This is easily done in a shell script:

aws ec2 run-instances \
	  --count 1 \
	  --instance-type myType \
	  --security-group-ids quicklaunch-1 \
	  --tag-specifications ResourceType=instance,Tags=[{Key=Name,Value=Hello}] > zzz

IID=$(jq -r '.Instances[].InstanceId' zzz)
echo launched ID $IID

#  The simple way to do this is just use the 'wait' command:
#    aws ec2 wait instance-running --instance-ids $IID
#  But we show the polling solution below because this gives the opportunity
#  to do something while waiting (like printing status or dots for each loop, etc.)
while true
do
    STATE=$(aws ec2 describe-instances --instance-ids $IID | jq -r '.Reservations[].Instances[].State.Name')
    if [ "$STATE" == "running" ]; then
	break
    fi
    sleep 10
done

# A bit of inefficiency calling it again (the PublicIpAddress field was in fact present
# in the payload where the State.Name was "running") but it is harder to deal with
# multiple value assignments in the shell 
IP=$(aws ec2 describe-instances --instance-ids $IID | jq -r '.Reservations[].Instances[].PublicIpAddress')
echo "IP: $IP"

Here is a one-liner complete with VT100 color output that reports on AWS EC2 instances. Be careful to distinguish the shell pipes from the jq pipes! Too bad we don't have printf formatting directly in jq as we could eliminate using awk. Since we are using awk, arguably it would be better to put the color setting conditional logic in there instead but we show it in jq just to show off a bit.


aws ec2 describe-instances \
   | jq -r '.Reservations[].Instances[]
     | if .State.Name == "running" then .COLOR="\u001B[32m"
       elif .State.Name == "stopped" then .COLOR="\u001B[31m"
       else .COLOR="\u001B[34" end
      | "\(.COLOR)\(.State.Name)\u001b[30m \(.InstanceId) \(.Tags[]
        | select(.Key == "Name") | .Value) 
          \(.InstanceType) \(.LaunchTime) \(.PublicIpAddress)"'

and with some additional empty Tags protection (shout out to David-Z), color assignment and a little
more output control in awk (and some escaped CRs for clarity).  
Note that since Instance Name in particular might have spaces, we use tilde
as a delimiter (too many bars already and colon and comma may pop up in Name
and/or IP as well).

aws ec2 describe-instances \
  | jq -r '.Reservations[].Instances[] 
    | "\(.State.Name)~\(.InstanceId)~\(first(.Tags[]
    | select(.Key == "Name").Value)? // "(none)")~\(.InstanceType)
        ~\(.LaunchTime)~\(.PublicIpAddress)" ' \
    | awk -F '~' '{ ip=""; if($1=="running"){color=32;ip=$6} else
      if($1=="stopped"){color=31} else {color=34} ;
      printf "\033[%dm%4.4s\033[30m %-20.20s %16.16s %12.12s %s %s\n",
      color, $1, $2, $3, $4, $5, ip;}'

Note this snazzy construction. The Tags field is an array of struct of string Key, string Value pairs:

  first(.Tags[] | select(.Key == "Name").Value)? // "(none)"

This selects ONLY the first struct in the tags array where field Key is "Name" and then extracts the Value from that struct. If anything does not resolve -- i.e. Tags is nonexistent, or is length zero, or has no Key == "Name" pair, the question mark operator will "catch" it and the alternate operator // will be used to yield the string "(none)"

Shell scripts, heredocs, backgrounding, and jq team up to make a powerful, compact, and performant ensemble:

for name in A B C D E F
do
    TF=/tmp/$name.$$.cmd
    cat <<EOS > $TF
aws ec2 run-instances ...
(various commands here)
EOS	     
    bash $TF > /tmp/$name.response.json &   # Background!
done

# Now wait for all those parallel executions to complete.
# This is a very powerful and useful idiom:  Easily launch a bunch of things
# in the background with '&' and then wait for them all to finish:
wait

# When control returns here, A-F.response.json will have all JSON outputs which
# may be accessed via jq

So Where Does It Start To Get Thorny?

At some point you need to start capturing and working with return codes, stdout, and stderr from these tasks. You'll also want the ability to easily examine the entire JSON data structure -- and potentially modify it -- and save it without rereading it over and over. Recall the "inefficiency" above. It is a lot less elegant when the shell has to deal with more than one piece of data coming out of jq:

while true
do
  read -r name ip <<<$(aws ec2 describe-instances --instance-ids $IID \
  | jq -r '.Reservations[].Instances[] | "\(.State.Name) \(.PublicIpAddress)"'
    ...

This also starts to bring whitespace management, quotes, and delimiters back into the picture. The basic problem is that outside of jq (and back in the shell), you lose the ability to fluidly work with rich structures.

In general, commands executed in a shell script need to do this:


    command args 1>theStdout.txt 2>theStderr.txt ; returncode=$?

For small scripts (or those where precise capture of outputs is not important) this is easy to manage but anything involving loops means that the trio of (returncode, stdout, stderr) needs to be captured and managed which means a whole mess of /tmp files. It is tempting to use the following:


   MYVAR=$(command args 2>&1) ; RC=$?

but of course in this configuration you cannot distinguish stdout from stderr and likely will rely solely on the return code.

A second point of irritation with shell scripts is arguments and quoting. Simple string and integer arguments work fine but consider trying to pass this to a command


    command --opt1 val --opt2 "val2 val3" --opt3 " \"val4\" " \
       --opt4 ' "noInterpInsideSingleQuotes" ' ...

Lastly, complex workflows touching different parts of the JSON data lead to lots of individual jq executions, each reading JSON input, modifying it, and writing it back out to a tmp file to protect against clobbering the file if a failure occurs:


  QQ=$(jq -r '.aaa.bbb')
  if [ condition ] ; then
      jq -r '.this.that | . + {"foo":"bar"}' $FILE > $FILE.tmp && mv $FILE.tmp $FILE
  else
      jq -r '.other | . + {"code":401}' $FILE > $FILE.tmp && mv $FILE.tmp $FILE
  fi
  jq -r '.status = "COMPLETE"' $FILE > $FILE.tmp && mv $FILE.tmp $FILE

python3 to the rescue

python3 brings a couple of big assets to the table:

The subprocess package of python3 nicely handles running commands both in the foreground (synchronous) and the background (asynchronous)
Support for heredocs via the triple-quote string declaration, including variable substitution
And of course, the json package makes slurping JSON and working with it as a native dict easy as pie.

So all we really need is some guidance on simplifying running processes synchronously vs. in the background with wait.

Running a synchronous command from python is easy:

import subprocess
p1 = subprocess.run(['ls', '-l'], capture_output=True)
if 0 == p1.returncode:
    print(p1.stdout)
else:
    print("ERROR: ", p1.stderr)

0
'total 120
-rwxr-xr-x  1 user  staff    57 Feb  1 15:29 args.sh
-rwxr-xr-x  1 user  staff    37 Feb  1 15:27 args.sh~
-rw-r ...

What is very important to appreciate here is that with the capture_output=True option, stdout and stderr are captured as attributes of the process object p1. Combined with the returncode attribute, you now have full programmatic inspection of the inputs and output without resorting to file descriptor redirections and /tmp files.

Let's compare the aws example:

import subprocess
import json
cmd = ['aws', 'ec2', 'run-instances',
          '--count', 1,
	  '--instance-type', 'myType'
	  '--tag-specifications', 'ResourceType=instance,Tags=[{Key=Name,Value=Hello}]' ]

p1 = subprocess.run(cmd, capture_output=True)
if 0 == p1.returncode:
    data = json.loads(p1.stdout)
    iid = data['Instances'][0]['InstanceId']

    cmd2 = ['aws','ec2','describe-instances','--instance-ids', iid]
    while True:
        p2 = subprocess.run(cmd2, capture_output=True)
        if 0 == p2.returncode:
          rr = json.loads(p2.stdout)
          inst = rr['Reservations'][0]['Instances'][0]
          if "running" == inst['State']['Name']:
              ip = inst['PublicIpAddress']
              break

It turns out to be only a few lines longer than the shell / jq version. And it is beginning to become clear that working with the whole data structure and apply if/then/else logic is easier in python. In the example here we dive straight to rr['Reservations'][0]['Instances'][0] but the entire data structure is now easily available to us; jq is not designed to allow easy "random access" to all levels of the structure once you have begun to descend into the nested layers. Although the piece-wise argument array may be a bit offputting at first compared to a shell command, it facilitates programmatic construction and again it eliminates the headache of escaping quotes.

It is also possible to run commands "in the background" by using the lower-level Popen command. With a little bit of extra work we can create a background group object upon which a "wait" can be emulated, as follows:

import subprocess
    
class BG:
    def __init__(self):
        self.items = []
    def launch(self, id, args):
        oo = {"id":id}
        oo['p'] = subprocess.Popen([str(x) for x in args],
                           stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        self.items.append(oo)
    def wait(self):
        for oo in self.items:
            (oo['stdout'], oo['stderr']) = oo['p'].communicate()
            oo['rc'] = oo['p'].returncode
    def results(self): return self.items

bg = BG()
for n in range(0,3):
    bg.launch(n, ['aws', 'ec2', 'run-instances', ... ])

# Three run-instances launched in background; wait for them:
bg.wait()

# This is the useful part: The bg results easily capture returncode, stdout, and stderr:
for oo in bg.results():
    print(oo['id'], oo['rc'])
    print('STDOUT: ', oo['stdout'])
    print('STDERR: ', oo['stderr'])

Like this? Dislike this? Let me know