Running Ansible as remote_user Requires Inventory

Maybe I just missed it. I was running a Jenkins job that triggered an Ansible role that pulled tar.gz files for several versions of my company’s software from a build server and deposited them in an NFS-shared directory on my Jenkins slave. The Jenkins slave was pulling dual duty as my local NFS server for nightly builds. Before the nightly builds ran, this Jenkins job would ensure that my NFS server had the proper files staged and ready to go. Sounds easy, right?

Nope. We kept having permissions issues. The Ansible role we created had two tasks:

  1. Clean out any unnecessary directories–for versions we were no longer supporting
  2. Create and populate directories for the versions we were supporting

We were experiencing permissions errors doing both tasks. I’ll save you the gory details, but we tried everything. We deleted everything. We used chmod, chown, and chgrp to set directory and file modes and ownership. We changed the Jenkins user. I tried running the playbook with become: yes. I tried sprinkling the tasks with remote_user: root. Nothing worked. I ran the job dozens of times, tweaking one thing at a time. Yuk.

Then I noticed something in the 4x verbose output of the job:

20:09:33  ESTABLISH SSH CONNECTION FOR USER: fred

I had set remote_user: root. Hmm. I checked another job that wasn’t having this problem. Sure enough, it was user root.

Here’s the difference: Playbook A, the one that was failing, didn’t use an inventory file because it was always executing on localhost. Playbook B, by contrast, used an inventory file. When I switched Playbook A to use an inventory file, everything worked. Bottom line: use an inventory file when you want to run as remote_user.

I suspect there may be a more elegant way to fix this, but in the fast-paced environment in which I work I am happy to have this solution.

Advertisements

Time-stamped Directory Name

One of my co-workers wrote an Ansible playbook that gathered and processed data from a number of nodes in our lab. There could be as many as 250 nodes in play. Here’s a high-level overview of the steps the playbook took:

  • Created a local temporary directory via local_action
  • Wrote intermediate files for each node to the local temp directory
  • Read collected intermediate files from the local temp directory
  • Deleted the local temp directory

Do you see the mistake? By default, Ansible will attempt to parallelize the operation across as many nodes as possible. The first one that finishes will–you guessed it–delete the temporary directory. Oops.

Initial testing was done against a single node. When I added a second, BOOM. After looking at it, I decided that I had the following viable options to fix:

  1. Use serial: 1 in the playbook to prevent concurrent execution. This is undesirable because it would make running against 250 nodes *much* longer.
  2. Restructure the playbooks such that temp directory creation and deletion took place outside of the data gathering. This would have been a lot of work *and* introduced dependencies between playbooks that I don’t like. Using the same temp directory name in more than one playbook is one example of such coupling.
  3. Use a unique temp directory for each node.

Not very elegant, but the last option listed, above, was simple and practical. A quick search yielded a code snippet similar to the following:

- name: Create a temporary directory name using timestamp
  set_fact:
    tmp_scripts_dir: > 
      "{{ playbook_dir }}/scripts/
       {{ lookup('pipe', 'date +%Y%m%d%H%M%S.%5N') }}/tmp"

This creates a temp directory name that includes a timestamp down to nanoseconds–fine enough detail to differentiate between multiple nodes that are kicked off within the same second. I then used tmp_scripts_dir to satisfy the process steps.

 

Custom Ansible filters: Easy solution to difficult problems

I have recently been using Ansible to automate health checks for some of our software-defined network (SDN) infrastructure. One of the devices my code must query for health is a soft router running the SROS operating system. Ansible 2.2 recently introduced support for an sros_command module (Info here) that simplifies my task somewhat, but I’m still left to do screen-scraping of the command output.

Screen scraping is nasty work. Lots of string processing with split(), strip(), and other commands. The resulting code is heavily dependent on the exact format of the command output. If it changes, the code breaks.

I initially implemented the screen-scraping using Jinja2 code in my playbooks. That put some pretty ugly, complex code right in the playbook. I found a better answer: Create a custom filter or two. Now things are *so much cleaner* in the playbooks themselves, the format-dependent code is now separated from the main code, and Python made it so much easier to code.

The best part: Ansible filters are very easy to create. The Ansible docs aren’t very helpful, perhaps because creation is so simple they thought it didn’t need explanation! The best way to figure out how to create your own filters is to look at some existing filters as a pattern to follow. The simplest of these is in Ansible itself, json_query. Here’s a stripped and simplified version of that code for the purpose of illustration. This code implements two trivial filters, my_to_upper and my_to_lower:

from ansible.errors import AnsibleError


def my_to_upper(string):
    ''' Given a string, return an all-uppercase version of the string.
    '''
    if string is None:
        raise AnsibleError('String not found')
    return string.upper()


def my_to_lower(string):
    ''' Given a string, return an all-lowercase version of the string.
    '''
    if string is None:
        raise AnsibleError('String not found')
    return string.lower()

class FilterModule(object):
    ''' Query filter '''

    def filters(self):
        return {
            'my_to_upper': my_to_upper,
            'my_to_lower': my_to_lower
    }

Developing this code is as simple as creating the FilterModule class, defining filters for each of the custom filters you need, and then providing a function for each filter. The example is trivial. I think you can see that you can make the filter functions as complex as required for your application.

Note that I have included AnsibleError in the example for illustration purposes because it is an extremely-useful way to get errors all the way to the console. If I were *really* implementing these filters, empty string wouldn’t be an error. I’d just return an empty string.

Here’s a couple of simple examples of how to call the filters and the resultant output:

- name: Create a mixed-case string
  shell: echo "A Mixed String"
  register: mixed_string
  delegate_to: localhost

- name: Print the UPPERCASE string
  debug: msg="{{ mixed_string.stdout|my_to_upper }}"

- name: Print the LOWERCASE string
  debug: msg="{{ mixed_string.stdout|my_to_lower }}"

<snip...>

TASK [my_task : Create a mixed-case string] *********************************
changed: [host.example.com -> localhost]

TASK [my_task : Print the UPPERCASE string] *********************************
ok: [host.example.com] => {
 "msg": "A MIXED STRING"
}

TASK [my_task : Print the LOWERCASE string] *********************************
ok: [host.example.com] => {
 "msg": "a mixed string"
}

In my case, instead of my_to_upper and my_to_lower, I created *command*_to_json filters that convert the SROS command output into JSON that is easily parsed in the playbook. This keeps my playbooks generic and isolates my filters as the place where the nasty code lives.

Verbose Output from git

Here’s a simple trick that provides more verbose text when using git:

GIT_CURL_VERBOSE=1 git clone https://github.com/repo/project.git

The

GIT_CURL_VERBOSE=1

is the key.

This change provided the difference I needed to debug.

Before:

Cloning into 'project'...
fatal: unable to access 'https://github.com/repo/project.git/': server certificate verification failed. CAfile: /etc/ssl/certs/ca-certificates.crt CRLfile: none

After:

Cloning into 'project'...
* Couldn't find host github.com in the .netrc file; using defaults
* Hostname was NOT found in DNS cache
* Trying 192.30.253.113...
* Connected to github.com (192.30.253.113) port 443 (#0)
* found 173 certificates in /etc/ssl/certs/ca-certificates.crt
* server certificate verification failed. CAfile: /etc/ssl/certs/ca-certificates.crt CRLfile: none
* Closing connection 0
fatal: unable to access 'https://github.com/repo/project.git/': server certificate verification failed. CAfile: /etc/ssl/certs/ca-certificates.crt CRLfile: none

Interesting!

Public key for epel-release-7-8.noarch.rpm is not installed

I was trying to run some ansible playbooks on my CentOS 7 Linux machine. I hit a failure because the version of ansible on the machine (1.9.4.0) was less than the minimum version required by the playbooks (2.1.0.0). yum install was seeing 1.9.4.0 as the latest.

It turns out what I needed was to pull a version of ansible from the EPEL repo rather than the default repo. yum repolist showed that the EPEL repo was already available on the machine, so I followed the instructions I found on the Internet: yum install ansible-2.1.0.0

The package was found and downloaded. But before the installation completed it hit an error:

Public key for epel-release-7-8.noarch.rpm is not installed

There is a very simple fix for this, as documented here and in other places:

rpm --import /etc/pki/rpm-gpg/RPM-GPG-KEY*

I don’t know how the system got into this state. It is a lab machine that gets used for many different experiments. I was happy to have found a simple fix.

What if ansible_default_ipv4 is Empty?

A colleague was attempting to use ansible to install Kubernetes, but he hit an error that confused him:

TASK [etcd : Write etcd config file] *******************************************

task path: /root/k8s-20160803-vishpat-contrib-git/contrib/ansible/roles/etcd/tasks/main.yml:23

fatal: [k8s-master.vnslab.net]: FAILED! => {"changed": false, "failed": true,
 "invocation": {"module_args": {"dest": "/etc/etcd/etcd.conf", "src": "etcd.conf.j2"},
 "module_name": "template"}, "msg": "AnsibleUndefinedVariable:
 {{ etcd_peer_url_scheme }}://{{ etcd_machine_address }}:{{ etcd_peer_port }}:
 {{ hostvars[inventory_hostname]['ansible_' + etcd_interface].ipv4.address }}:
 {{ ansible_default_ipv4.interface }}: 'dict object' has no attribute 'interface'"}

I asked him for a copy of the setup module (gather facts) for the host in question:

ansible -i 'your_host_name,' -m setup

This portion of the output jumped out at me:

<snip>
       },
        "ansible_default_ipv4": {},
        "ansible_default_ipv6": {},
        "ansible_devices": {
</snip>

ansible_default_ipv4 was empty. This was the root cause of the problem. When ansible tries to deploy the etcd template from roles/etcd/templates/etcd.conf.j2 it hits the following lines and attempts to substitute values for the variables:

<snip>
{% for host in groups[etcd_peers_group] -%}
  {{ hostvars[host]['ansible_hostname'] }}={{ etcd_peer_url_scheme }}:
    //{{ hostvars[host]['ansible_' + etcd_interface].ipv4.address }}:
    {{ etcd_peer_port }}
  {%- if not loop.last -%},{%- endif -%}
{%- endfor -%}
</snip>

And the definition of etcd_interface depends on ansible_default_ipv4 being populated. From roles/etcd/defaults/main.yaml: 

<snip>
# Interface on which etcd listens.
# Useful on systems when default interface is not connected to other machines,
# for example as in Vagrant+VirtualBox configuration.
# Note that this variable can't be set in per-host manner with current implementation.
etcd_interface: "{{ ansible_default_ipv4.interface }}"
</snip>

The result: When ansible tries to deploy the etcd.config template, it discovers that ansible_default_ipv4.interface doesn’t exist. It throws up its hands.

The fix: Setup a default route on the host under consideration. Instructions can be found here:

http://linux-ip.net/html/basic-changing.html#basic-changing-default

Once the change to to the host was made, ansible_default_ipv4.interface was populated! Problem solved!

Configuring Go CD for passwordless ssh clone from GitHub

I recently installed Go CD to do some CI/CD proof of concept work. (https://www.go.cd/) Go CD is a continuous delivery server based on pipelines of work to be accomplished. It integrates with several CRM systems, including GitHub.

After the install, I was prompted by the Go CD GUI to create my first pipeline. Each pipeline has 3 main parts:

  • Basic Settings
  • Materials
  • Stage/Job

Basic settings are simple: Name of the pipeline and the pipeline group it belongs to. It is in the setting up of the Materials that I ran into trouble.

The Materials page requires 3 pieces of information:

  • Material Type (GitHub)
  • URL (SSH clone URL)
  • Branch (master by default)

When I clicked “Check Connection”, I received the following error:

--- ERROR ---
STDERR: Permission denied (publickey).
STDERR: fatal: Could not read from remote repository.
STDERR: 
STDERR: Please make sure you have the correct access rights
STDERR: and the repository exists.
---

The problem is that Go CD Server runs as the “go” user on my Go CD master and slaves. I needed to add the “go” user’s public ssh key for each server (master and all slaves) to GitHub as a Deploy Key.

On each server (master and slaves), I did the following:

  • Login to the server using ssh
  • Change to the “go” user via ‘sudo su go’
  • Create the “go” user’s public key via ‘ssh-keygen’. I went with default path and file name (‘/var/go/.ssh/id_rsa’) and gave no passphrase.
  • Copied the contents of ‘/var/go/.ssh/id_rsa.pub’ to GitHub as a new Deploy Key.

On each server (master and slaves), I tested the connection by executing a command-line git clone. In each case, I was prompted to permanently add the GitHub server to my list of known hosts.

Once this was complete, “Check Connection” passed with “OK”.