jq vs. python: Which is Better for Scripting With JSON?

2-Feb-2021 Like this? Dislike this? Let me know

jq (available here) is the de facto standard for command line / shell script utilities dealing with JSON. In the new JSON data ecosystem, it is a prevalent and important and useful as sed and awk. Thanks to a compact expression syntax and a good approach to dealing with arrays, jq makes it easy to extract values from complex JSON structures.
But is it really the best multipurpose tool for the job, especially as the data manipulation requirements become increasingly complex? As your process control, data interrogation, and command input/output needs become more complex, running it all in python is clearly more attractive.

Some Quick Examples In Defense of jq

Consider this JSON:

  "items": [
    {"name":"corn", "id":1, "hist": [
    {"name":"wheat", "id":2, "hist": [
    {"name":"rice", "id":3, "hist": [
How might we find the value of the last entry in the hist array for everything except corn?

cat thatJson | jq -r '.items[] | select(.name != "corn") | "\(.name) was \(.hist[-1].v)" ' 
wheat was 400
rice was 600

There is plenty of material on the web to describe the jq commands and expressions but let us just run through our example: The -r (raw) option tells jq to emit strings without the proper JSON double quote wrappers, which is useful in shell scripting.

Many cloud provider CLIs return complex JSON shapes; jq is a superb way to work with this content. For example, launching an AWS VM returns a complex structure -- but your VM is not ready yet. You must poll to find out when it is actually running. This is easily done in a shell script:

aws ec2 run-instances \
	  --count 1 \
	  --instance-type myType \
	  --security-group-ids quicklaunch-1 \
	  --tag-specifications ResourceType=instance,Tags=[{Key=Name,Value=Hello}] > zzz

IID=$(jq -r '.Instances[].InstanceId' zzz)
echo launched ID $IID

while true
    STATE=$(aws ec2 describe-instances --instance-ids $IID | jq -r '.Reservations[].Instances[].State.Name')
    if [ "$STATE" == "running" ]; then
    sleep 10

# A bit of inefficiency calling it again (the PublicIpAddress field was in fact present
# in the payload where the State.Name was "running") but it is harder to deal with
# multiple value assignments in the shell 
IP=$(aws ec2 describe-instances --instance-ids $IID | jq -r '.Reservations[].Instances[].PublicIpAddress')
echo "IP: $IP"

Here is a one-liner complete with VT100 color output that reports on AWS EC2 instances. Be careful to distinguish the shell pipes from the jq pipes! Too bad we don't have printf formatting directly in jq as we could eliminate using awk. Since we are using awk, arguably it would be better to put the color setting conditional logic in there instead but we show it in jq just to show off a bit.

aws ec2 describe-instances \
   | jq -r '.Reservations[].Instances[]
      | if .State.Name == "running" then .COLOR="\u001B[32m" elif .State.Name == "stopped" then .COLOR="\u001B[31m" else .COLOR="\u001B[34" end
      | "\(.COLOR)\(.State.Name)\u001b[30m \(.InstanceId) \(.Tags[] | select(.Key == "Name") | .Value) \(.InstanceType) \(.LaunchTime) \(.PublicIpAddress)"'

and with some additional empty Tags protection (shout out to David-Z), color assignment and a little
more output control in awk (and some escaped CRs for clarity).  
Note that since Instance Name in particular might have spaces, we use tilde
as a delimiter (too many bars already and colon and comma may pop up in Name
and/or IP as well).

aws ec2 describe-instances \
  | jq -r '.Reservations[].Instances[] | "\(.State.Name)~\(.InstanceId)~\(first(.Tags[] | select(.Key == "Name").Value)? // "(none)")~\(.InstanceType)~\(.LaunchTime)~\(.PublicIpAddress)" ' \
  | awk -F '~' '{ ip=""; if($1=="running"){color=32;ip=$6} else if($1=="stopped"){color=31} else {color=34} ; printf "\033[%dm%4.4s\033[30m %-20.20s %16.16s %12.12s %s %s\n", color, $1, $2, $3, $4, $5, ip;}'
Note this snazzy construction. The Tags field is an array of struct of string Key, string Value pairs:
  first(.Tags[] | select(.Key == "Name").Value)? // "(none)"
This selects ONLY the first struct in the tags array where field Key is "Name" and then extracts the Value from that struct. If anything does not resolve -- i.e. Tags is nonexistent, or is length zero, or has no Key == "Name" pair, the question mark operator will "catch" it and the alternate operator // will be used to yield the string "(none)"

Shell scripts, heredocs, backgrounding, and jq team up to make a powerful, compact, and performant ensemble:

for name in A B C D E F
    cat <<EOS > $TF
aws ec2 run-instances ...
(various commands here)
    bash $TF > /tmp/$name.response.json &   # Backgroud!

# Now wait for all those parallel executions to complete.
# This is a very powerful and useful idiom:  Easily launch a bunch of things
# in the background with '&' and then wait for them all to finish:

# When control returns here, A-F.response.json will have all JSON outputs which
# may be accessed via jq

So Where Does It Start To Get Thorny?

At some point you need to start capturing and working with return codes, stdout, and stderr from these tasks. You'll also want the ability to easily examine the entire JSON data structure -- and potentially modify it -- and save it without rereading it over and over. Recall the "inefficiency" above. It is a lot less elegant when the shell has to deal with more than one piece of data coming out of jq:

while true
    read -r name ip <<<$(aws ec2 describe-instances --instance-ids $IID | jq -r '.Reservations[].Instances[] | "\(.State.Name) \(.PublicIpAddress)"'
This also starts to bring whitespace management, quotes, and delimiters back into the picture. The basic problem is that outside of jq (and back in the shell), you lose the ability to fluidly work with rich structures.

In general, commands executed in a shell script need to do this:

    command args 1>theStdout.txt 2>theStderr.txt ; returncode=$?

For small scripts (or those where precise capture of outputs is not important) this is easy to manage but anything involving loops means that the trio of (returncode, stdout, stderr) needs to be captured and managed which means a whole mess of /tmp files. It is tempting to use the following:

   MYVAR=$(command args 2>&1) ; RC=$?

but of course in this configuration you cannot distinguish stdout from stderr and likely will rely solely on the return code.

A second point of irritation with shell scripts is arguments and quoting. Simple string and integer arguments work fine but consider trying to pass this to a command

    command --opt1 val --opt2 "val2 val3" --opt3 " \"val4\" " --opt4 ' "noInterpInsideSingleQuotes" ' ...

Lastly, complex workflows touching different parts of the JSON data lead to lots of individual jq executions, each reading JSON input, modifying it, and writing it back out to a tmp file to protect against clobbering the file if a failure occurs:

  QQ=$(jq -r '.aaa.bbb')
  if [ condition ] ; then
      jq -r '.this.that | . + {"foo":"bar"}' $FILE > $FILE.tmp && mv $FILE.tmp $FILE
      jq -r '.other | . + {"code":401}' $FILE > $FILE.tmp && mv $FILE.tmp $FILE
  jq -r '.status = "COMPLETE"' $FILE > $FILE.tmp && mv $FILE.tmp $FILE

python3 to the rescue

python3 brings a couple of big assets to the table:

So all we really need is some guidance on simplifying running processes synchronously vs. in the background with wait.

Running a synchronous command from python is easy:

import subprocess
p1 = subprocess.run(['ls', '-l'], capture_output=True)
if 0 == p2.returncode:
    print("ERROR: ", p2.stderr)

'total 120\n-rwxr-xr-x  1 user  staff    57 Feb  1 15:29 args.sh\n-rwxr-xr-x  1 user  staff    37 Feb  1 15:27 args.sh~\n-rw-r ...
What is very important to appreciate here is that with the capture_output=True option, stdout and stderr are captured as attributes of the process object p1. Combined with the returncode attribute, you now have full programmatic inspection of the inputs and output without resorting to file descriptor redirections and /tmp files.

Let's compare the aws example:

import subprocess
import json
cmd = ['aws', 'ec2', 'run-instances',
          '--count', 1,
	  '--instance-type', 'myType'
	  '--tag-specifications', 'ResourceType=instance,Tags=[{Key=Name,Value=Hello}]' ]

p1 = subprocess.run(cmd, capture_output=True)
if 0 == p1.returncode:
    data = json.loads(p1.stdout)
    iid = data['Instances'][0]['InstanceId']

    cmd2 = ['aws','ec2','describe-instances','--instance-ids', iid]
    while True:
        p2 = subprocess.run(cmd2, capture_output=True)
        if 0 == p2.returncode:
          rr = json.loads(p2.stdout)
          inst = rr['Reservations'][0]['Instances'][0]
          if "running" == inst['State']['Name']:
              ip = inst['PublicIpAddress']
It turns out to be only a few lines longer than the shell / jq version. And it is beginning to become clear that working with the whole data structure and apply if/then/else logic is easier in python. In the example here we dive straight to rr['Reservations'][0]['Instances'][0] but the entire data structure is now easily available to us; jq is not designed to allow "random access" to the structure. Although the piece-wise argument array may be a bit offputting at first compared to a shell command, it facilitates programmatic construction and again it eliminates the headache of escaping quotes.

It is also possible to run commands "in the background" by using the lower-level Popen command. With a little bit of extra work we can create a background group object upon which a "wait" can be simulated, as follows:

import subprocess
class BG:
    def __init__(self):
        self.items = []
    def launch(self, id, args):
        oo = {"id":id}
        oo['p'] = subprocess.Popen([str(x) for x in args], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    def wait(self):
        for oo in self.items:
            (oo['stdout'], oo['stderr']) = oo['p'].communicate()
            oo['rc'] = oo['p'].returncode
    def results(self): return self.items

bg = BG()
for n in range(0,3):
    bg.launch(n, ['aws', 'ec2', 'run-instances', ... ])

# Three run-instances launched in background; wait for them:

# This is the useful part: The bg results easily capture returncode, stdout, and stderr:
for oo in bg.results():
    print(oo['id'], oo['rc'])
    print('STDOUT: ', oo['stdout'])
    print('STDERR: ', oo['stderr'])

Like this? Dislike this? Let me know

Site copyright © 2013-2021 Buzz Moschetti. All rights reserved