Rich data on the CLI

07/29/2011

I’ve often wondered how things will change in a world where everything is a REST API and how relevant our Unix CLI tool chain will be in the long run. I’ve known we needed CLI ways to interact with data – like JSON data – and have given this a lot of thought.

MS Powershell does some pretty impressive object parsing on their CLI but I was never really sure how close we could get to that in Unix. I’ve wanted to start my journey with the grep utility as that seemed a natural starting point and my most used CLI tool.

I have no idea how to write parsers and matchers but luckily I have a very talented programmer working for me who were able to take my ideas and realize them awesomely. Pieter wrote a json grep and I want to show off a few bits of what it can do.

I’ll work with the document below:

[
  {"name":"R.I.Pienaar",
   "contacts": [
                 {"protocol":"twitter", "address":"ripienaar"},
                 {"protocol":"email", "address":"rip@devco.net"},
                 {"protocol":"msisdn", "address":"1234567890"}
               ]
  },
  {"name":"Pieter Loubser",
   "contacts": [
                 {"protocol":"twitter", "address":"pieterloubser"},
                 {"protocol":"email", "address":"foo@example.com"},
                 {"protocol":"msisdn", "address":"1234567890"}
               ]
  }
]

There are a few interesting things to note about this data:

  • The document is an array of hashes, this maps well to the stream of data paradigm we know from lines of text in a file. This is the basic structure jgrep works on.
  • Each document has another nested set of documents in an array – the contacts array.

Examples


The examples below show a few possible grep use cases:

A simple grep for a single key in the document:

$ cat example.json | jgrep "name='R.I.Pienaar'"
[
  {"name":"R.I.Pienaar",
   "contacts": [
                 {"protocol":"twitter", "address":"ripienaar"},
                 {"protocol":"email", "address":"rip@devco.net"},
                 {"protocol":"msisdn", "address":"1234567890"}
               ]
  }
]

We can extract a single key from the result:

$ cat example.json | jgrep "name='R.I.Pienaar'" -s name
R.I.Pienaar

A simple grep for 2 keys in the document:

% cat example.json | 
    jgrep "name='R.I.Pienaar' and contacts.protocol=twitter" -s name
R.I.Pienaar

The nested document pose a problem though, if we were to search for contacts.protocol=twitter and contacts.address=1234567890 we will get both documents and not none, that’s because in order to effectively search the sub documents we need to ensure that these 2 values exist in the same sub document.

$ cat example.json | 
     jgrep "[contacts.protocol=twitter and contacts.address=1234567890]"

Placing [] around the 2 terms works like () but restricts the search to the specific sub document. In this case there is no sub document in the contacts array that has both twitter and 1234567890.

Of course you can have many search terms:

% cat example.json | 
     jgrep "[contacts.protocol=twitter and contacts.address=1234567890] or name='R.I.Pienaar'" -s name
R.I.Pienaar

We can also construct entirely new documents:

% cat example.json | jgrep "name='R.I.Pienaar'" -s "name contacts.address"
[
  {
    "name": "R.I.Pienaar",
    "contacts.address": [
      "ripienaar",
      "rip@devco.net",
      "1234567890"
    ]
  }
]

Real World

So I am adding JSON output support to MCollective, today I was rolling out a new Nagios check script to my nodes and wanted to be sure they all had it. I used the File Manager agent to fetch the stats for my file from all the machines then printed the ones that didn’t match my expected MD5.

$ mco rpc filemgr status file=/.../check_puppet.rb -j | 
   jgrep 'data.md5!=a4fdf7a8cc756d0455357b37501c24b5' -s sender
box1.example.com

Eventually you will be able to then pipe this output to mco again and call another agent, here I take all the machines that didn’t yet have the right file and cause a puppet run to happen on them, this is very Powershell like and the eventual use case I am building this for:

$ mco rpc filemgr status file=/.../check_puppet.rb -j | 
   jgrep 'data.md5!=a4fdf7a8cc756d0455357b37501c24b5' |
   mco rpc puppetd runonce

I also wanted to know the total size of a logfile across my web servers to be sure I would have enough space to copy them all:

$ mco rpc filemgr status file=/var/log/httpd/access_log -W /apache/ -j |
    jgrep -s "data.size"|
    awk '{ SUM += $1} END { print SUM/1024/1024 " MB"}'
2757.9093 MB

Now how about interacting with a webservice like the GitHub API:

$ curl -s http://github.com/api/v2/json/commits/list/puppetlabs/marionette-collective/master|
   jgrep --start commits "author.name='Pieter Loubser'" -s id
52470fee0b9fe14fb63aeb344099d0c74eaf7513

Here I fetched the most recent commits in the marionette-collective GitHub repository, searched for ones by Pieter and returns the ID of those commits. The –start argument is needed because the top of the JSON returned is not the array we care for. The –start tells jgrep to take the commits key and grep that.

Or since it’s Sysadmin Appreciation Day how about tweets about it:

% curl -s "http://search.twitter.com/search.json?q=sysadminday"|
   jgrep --start results -s "text"
 
RT @RedHat_Training: Did you know that today is Systems Admin Day?  A big THANK YOU to all our system admins!  Here's to you!  http://t.co/ZQk8ifl
RT @SinnerBOFH: #BOFHers RT @linuxfoundation: Happy #SysAdmin Day! You know who you are, rock stars. http://t.co/kR0dhhc #linux
RT @google: Hey, sysadmins - thanks for all you do. May your pagers be silent and your users be clueful today! http://t.co/N2XzFgw
RT @google: Hey, sysadmins - thanks for all you do. May your pagers be silent and your users be clueful today! http://t.co/y9TbCqb #sysadminday
RT @mfujiwara: http://www.sysadminday.com/
RT @mitchjoel: It's SysAdmin Day! Have you hugged your SysAdmin today? Make sure all employees follow the rules: http://bit.ly/17m98z #humor
? @mfujiwara: http://www.sysadminday.com/

Here as before we have to grep the results array that is contained inside the results.

I can also find all the restaurants near my village via SimpleGEO:

curl -x localhost:8001 -s "http://api.simplegeo.com/1.0/places/51.476959,0.006759.json?category=Restaurant"|
   jgrep --start features "properties.distance<2.0" -s "properties.address \
                                      properties.name \
                                      properties.postcode \
                                      properties.phone \
                                      properties.distance"
[
  {
    "properties.address": "15 Stratheden Road",
    "properties.distance": 0.773576114771768,
    "properties.phone": "+44 20 8858 8008",
    "properties.name": "The Lamplight",
    "properties.postcode": "SE3 7TH"
  },
  {
    "properties.address": "9 Stratheden Parade",
    "properties.distance": 0.870622234751732,
    "properties.phone": "+44 20 8858 0728",
    "properties.name": "Sun Ya",
    "properties.postcode": "SE3 7SX"
  }
]

There’s a lot more I didn’t show, it supports all the usual <= etc operators and a fair few other bits.

You can get this utility by installing the jgrep Ruby Gem or grab the code from GitHub. The Gem is a library so you can use these abilities in your ruby programs but also includes the CLI tool shown here.

It’s pretty new code and we’d totally love feedback, bugs and ideas! Follow the author on Twitter at @pieterloubser and send him some appreciation too.