Debugging Chef

We have a pretty elaborate Chef setup where I work.  We use the open source Chef Server where the nodes get their configuration and recipes.  Everything is version controlled, automatically tested, and deployed via a controlled process.  But all those controls can slow down development.  Sometimes you just want to hack up an idea up on a real machine before investing the time to code up a solution, write tests, run style checks, version all the things, feed it through the CI pipeline, and ultimately deploy to tons of nodes.  (Since we have an simple re-imaging process, screwing up the node configuration via manual hacking is NBD.)

I have found chef-shell a great tool for doing this sort of thing.  It’s a simple matter to log in to a node and run chef-shell -z to start up an interactive ruby shell that loads (and runs) the recipes that the node would normally run when you do chef-client.  The nice part is you can tweak a recipe and re-run it.  For example, I had a recipe that was failing:

windows_zipfile "#{cache_path}/Downloads/drivers/usb3_#{node['nuc']['usb3_driver_version']}" do
 source node['nuc']['usb3_driver_url']
 action :unzip
 not_if { ::File.exist?("#{cache_path}/Downloads/drivers/usb3_#{node['nuc']['usb3_driver_version']}") }
end

package 'Intel(R) USB 3.0 eXtensible Host Controller Driver' do
 source "#{cache_path}/Downloads/drivers/usb3_#{node['nuc']['usb3_driver_version']}/Intel(R)_USB_3.0_eXtensible_Host_Controller_Driver_3.0.5.69/Setup.exe"
 action :install
 version '3.0.5.69'
end

The package install was returning ‘1’ instead of ‘0’, but providing no stdout/stderr.  I suspected this was a non-standard installation package, so I wanted to try it as a :custom package.  I simply had to enter chef-shell’s “recipe mode”, get a reference to the existing resource, modify it, and run it:

recipe_mode
x = resources("package[Intel(R) USB 3.0 eXtensible Host Controller Driver]")
x.installer_type = :custom
x.options = '-s'
x.run_action(:install)

This worked, although I found the return code was 14 (which means reboot required for this installer) so I went ahead and coded it up as:

package 'Intel(R) USB 3.0 eXtensible Host Controller Driver' do
 source "#{cache_path}/Downloads/drivers/usb3_#{node['nuc']['usb3_driver_version']}/Intel(R)_USB_3.0_eXtensible_Host_Controller_Driver_3.0.5.69/Setup.exe"
 action :install
 options '-s'
 installer_type :custom
 version '3.0.5.69'
 returns [0, 14]
end

References

Timberman, J. (2014, July 21). Load_current_resource and Chef-shell [Web log post]. Retrieved November 30, 2016, from http://jtimberman.housepub.org/blog/2014/07/21/load-current-resource-and-chef-shell/

Advertisements

Test Kitchen Ansible Serverspec Debug

I recently set up test kitchen in our Ansible repo, and saw that is passed the “swarm is running” spec even though swarm wasn’t running. Pretty disturbing behavior from a testing tool. This is the story of how I got to the bottom of it.

We use serverspec, an extension to rspec, which provides matchers for system configuration stuff. The spec in question is just:

describe service('swarm') do
 it { should be_running }
 end

I read through the serverspec source and found the commands it uses to verify services, but running those on the VM gave expected outputs, so it was time for some debugging.

First, I wanted to eliminate the Test Kitchen infrastructure and just run rspec directly on the VM. After converging the VM, I ran verify with debug output to see how it is calling rspec:

% kitchen verify --log-level debug
 ---clipped---
 cd /tmp/kitchen
 RSPEC_CMD=$(which rspec)
 echo $RSPEC_CMD
 sudo -E -H $RSPEC_CMD -c -f documentation --default-path /tmp/kitchen -P ./roles/swarm/spec/swarm_spec.rb
 ---clipped---

Running this while logged in to the VM duplicates the problem, so that’s good. Next, I added byebug to the Gemfile in /tmp/kitchen and ran bundle install.

Then I ran rspec pre-pending “byebug” so it will drop me into the byebug debugger:

$ sudo -E -H byebug $RSPEC_CMD -c -f documentation --default-path /tmp/kitchen -P ./roles/swarm/spec/swarm_spec.rb

From examining the serverspec code, I know that I’m looking for a call to the Service.running? method, so add a breakpoint for it:

(byebug) break Serverspec::Type::Service#running?

From here I can easily step into the code that’s checking the service:

[1, 10] in /Users/build/.rvm/gems/ruby-2.1.3/gems/specinfra-2.61.1/lib/specinfra/processor.rb
 1: module Specinfra
 2: class Processor
 3: def self.check_service_is_running(service)
 4: cmd = Specinfra.command.get(:check_service_is_running, service)
 5: ret = Specinfra.backend.run_command(cmd)
 => 6:

(byebug) cmd
 "launchctl list | grep swarm | grep -E '^[0-9]+'"

(byebug) ret
 #

Exit status 1 looks right: the grep doesn’t match anything. Ah, but there’s more: if the exit status is 1, it does a second check:

[13, 22] in /Users/build/.rvm/gems/ruby-2.1.3/gems/specinfra-2.61.1/lib/specinfra/processor.rb
 13: if ret.exit_status == 1
 14: cmd = Specinfra.command.get(:check_process_is_running, service)
 15: ret = Specinfra.backend.run_command(cmd)
 16: end
 17:
 => 18: ret.success?
 19: end
 20:

(byebug) cmd
 "ps aux | grep -w -- swarm | grep -qv grep"

(byebug) ret
 #

Exit status zero means the grep matched something — that the service is running. That’s not right. Removing the “q” from the last grep shows:

$ ps aux | grep -w -- swarm | grep -v grep
 root 72088 0.0 0.6 2557624 12944 s000 S+ 5:07PM 0:00.06 sudo -E -H byebug /usr/local/bin/rspec -c -f documentation --default-path /tmp/kitchen -P ./roles/swarm/spec/swarm_spec.rb

“swarm” matches the path and name of the spec file being executed. LOL.

So what’s the right fix here? It seems like pgrep might be a better choice than ps | grep in this case because we only want to search on the process name.

To confirm this works, I cloned the specinfra repo in the VM, made the change, and added this to the Gemfile in the VM:

path '/tmp/specinfra' do
 gem 'specinfra'
 end

And then built the gem with gem build specinfra.gemspec and installed it with gem install –local specinfra-2.61.1.gem (I could also use bundle exec in the rspec command-line instead of relying on globally installed gems, but this is a VM I’m going to throw away, so it doesn’t much matter). I then saw rspec fail the test locally (in the VM) as expected:

1) Service "swarm" should be running
 Failure/Error: it { should be_running }
 expected Service "swarm" to be running
 /bin/sh -c pgrep\ swarm

I then ran kitchen verify just for fun, expecting to see the spec fail, but was briefly confused when it passed. I finally realized that the ruby version in the terminal and via ssh were not the same. Ansible installed rvm when I converged the VM, modifying the PATH in ~/.bash_profile, so terminal logins got rvm, whereas ssh-based commands were still getting the default MacOS ruby 1.9 (and gems associated with that version). Once I realized this, I was able to install the patch for 1.9 and see the spec fail with kitchen verify as well.

Fun with Chef and Powershell

When writing configuration code, you only want to take action when necessary. Chef often takes care of this for you. For example, if you use the “package” resource to install a package, it will only install it if it is not already installed. Sometimes, though, you have to handle this logic yourself. This is what it means to make a resource declaration “idempotent”.

To support this, Chef provides guards: only_if and not_if which can be given a bit of ruby (block) or shell script (string). The latter is implemented using the shell_out mixin which knows how to deal with various malfunctioning Unix fork implementations, Ruby garbage collection bugs, reaping dead children, etc. If the shell command returns zero, it is interpreted as a Ruby True value, which controls the guard appropriately. This all works well and predictably in Mac/Linux.

Windows, however, is another beast which tripped me up in a subtle way. Hence this post.

First off, you can try to set guard_interpreter :powershell_script in your resource, and that will execute a guard string in powershell. In my case, I wanted to use select-string to replace grep but select-string does not set exit codes like grep does, so it seemed like a non-starter.

Then I learned about powershell_out, a mixin from the community powershell cookbook which mimics the behavior of shell_out on Windows. That was moved to the windows 3rd party cookbook, parts of which are now part of Chef core. Lots of docs say you have to include 3rd party cookbooks, or add special “Mixin” incantations to your recipe file. In fact, we should be able to simply do this:

package 'USB_3.0_Driver' do
 source "#{driver_exe}"
 action :install
 options '/s' # silent
 installer_type :custom
 only_if { powershell_out('driverquery | select-string "USB 3.0"').stdout.empty? }
 end

But this, in fact, will install the driver every time. The key to debugging this was to realize there is another version of powershell_out: powershell_out! which raises exceptions on error (Chef will silently suppress them otherwise). That clearly showed the problem; lots of these:

Select-String : The input object cannot be bound to any parameters for the command either because the command does not
 take pipeline input or the input and its properties do not match any of the parameters that take pipeline input.
 At line:1 char:28
 + driverquery | select-string <<<< USB 3.0
 + CategoryInfo : InvalidArgument: (WinUSB Wi...010 2:43:56 AM :PSObject) [Select-String], Parameter
 BindingException
 + FullyQualifiedErrorId : InputObjectNotBound,Microsoft.PowerShell.Commands.SelectStringCommand

followed by:

Ran powershell.exe -NoLogo -NonInteractive -NoProfile -ExecutionPolicy Unrestricted -InputFormat None -Command "driverquery | select-string "USB 3.0"" returned 1

Oh, snap! It’s a quoting issue. All I had to do was reverse the polarity of the quotes in the powershell_out command to fix the problem:

package 'USB_3.0_Driver' do
 source "#{driver_exe}"
 action :install
 options '/s' # silent
 installer_type :custom
 only_if { powershell_out("driverquery | select-string 'USB 3.0'").stdout.empty? }
 end

After thinking about this more, I realized that it should be possible to set the exit code using a powershell if statement, and sure enough, this also works:

package 'USB_3.0_Driver' do
 source "#{driver_exe}"
 action :install
 options '/s' # silent
 installer_type :custom
 guard_interpreter :powershell_script
 only_if "if ( $( driverquery ) -Match 'USB 3.0' ) {exit 1} else {exit 0}"
 end

And it also works if I put the quotes the other way around:

only_if 'if ( $( driverquery ) -Match "USB 3.0" ) {exit 1} else {exit 0}'

Given that more stable behavior around quotes, and not having to involve another method, I think I will prefer this to powershell_out.