Using Cumulus NetQ fabric validation with Ansible

Here a new post about Cumulus NetQ, I build a small Ansible playbook to validate the state of MLAG within a Cumulus Linux fabric using automation.

In this case I use the command “netq check clag json” to check for nodes in failed or warning state. This example can be used when doing automated changes to MLAG and to validate the configuration afterwards, or as a pre-check before I execute the main playbook.

---
- hosts: spine leaf
  gather_facts: False
  user: cumulus

  tasks:
     - name: Gather Clag info in JSON
       command: netq check clag json
       register: result
       run_once: true
       failed_when: "'ERROR' in result.stdout"

     - name: stdout string into json
       set_fact: json_output="{{result.stdout | from_json }}"
       run_once: true

     - name: output of json_output variable
       debug:
         var: json_output
       run_once: true

     - name: check failed clag members
       debug: msg="Check failed clag members"
       when: json_output["failedNodes"]|length == 0
       run_once: true

     - name: clag members status failed
       fail: msg="Device {{item['node']}}, Why node is in failed state? {{item['reason']}}"
       with_items:  "{{json_output['failedNodes']}}"
       run_once: true

     - name: clag members status warning
       fail: msg="Device {{item['node']}}, Why node is in warning state? {{item['reason']}}"
       when: json_output["warningNodes"] is defined
       with_items:  "{{json_output['warningNodes']}}"
       run_once: true

Here the output when MLAG is healthy:

PLAY [spine leaf] *********************************************************************************************************************************************************************************************************************

TASK [Gather Clag info in JSON] *******************************************************************************************************************************************************************************************************
Friday 20 October 2017  17:56:35 +0200 (0:00:00.017)       0:00:00.017 ********
changed: [spine-1]

TASK [stdout string into json] ********************************************************************************************************************************************************************************************************
Friday 20 October 2017  17:56:35 +0200 (0:00:00.325)       0:00:00.343 ********
ok: [spine-1]

TASK [output of json_output variable] *************************************************************************************************************************************************************************************************
Friday 20 October 2017  17:56:35 +0200 (0:00:00.010)       0:00:00.353 ********
ok: [spine-1] => {
    "json_output": {
        "failedNodes": [],
        "summary": {
            "checkedNodeCount": 4,
            "failedNodeCount": 0,
            "warningNodeCount": 0
        }
    }
}

TASK [check failed clag members] ******************************************************************************************************************************************************************************************************
Friday 20 October 2017  17:56:35 +0200 (0:00:00.010)       0:00:00.363 ********
ok: [spine-1] => {
    "msg": "Check failed clag members"
}

TASK [clag members status failed] *****************************************************************************************************************************************************************************************************
Friday 20 October 2017  17:56:35 +0200 (0:00:00.011)       0:00:00.374 ********

TASK [clag members status warning] ****************************************************************************************************************************************************************************************************
Friday 20 October 2017  17:56:35 +0200 (0:00:00.007)       0:00:00.382 ********
skipping: [spine-1]

PLAY RECAP ****************************************************************************************************************************************************************************************************************************
spine-1                    : ok=4    changed=1    unreachable=0    failed=0

Friday 20 October 2017  17:56:35 +0200 (0:00:00.008)       0:00:00.391 ********
===============================================================================
Gather Clag info in JSON ------------------------------------------------ 0.33s
check failed clag members ----------------------------------------------- 0.01s
stdout string into json ------------------------------------------------- 0.01s
output of json_output variable ------------------------------------------ 0.01s
clag members status warning --------------------------------------------- 0.01s
clag members status failed ---------------------------------------------- 0.01s

In the following example leaf-1 node is in warning state because of a missing “clagd-backup-ip“, another warning could be also a single attached bond interface:

PLAY [spine leaf] *********************************************************************************************************************************************************************************************************************

TASK [Gather Clag info in JSON] *******************************************************************************************************************************************************************************************************
Friday 20 October 2017  18:02:05 +0200 (0:00:00.016)       0:00:00.016 ********
changed: [spine-1]

TASK [stdout string into json] ********************************************************************************************************************************************************************************************************
Friday 20 October 2017  18:02:05 +0200 (0:00:00.225)       0:00:00.241 ********
ok: [spine-1]

TASK [output of json_output variable] *************************************************************************************************************************************************************************************************
Friday 20 October 2017  18:02:05 +0200 (0:00:00.010)       0:00:00.251 ********
ok: [spine-1] => {
    "json_output": {
        "failedNodes": [],
        "summary": {
            "checkedNodeCount": 4,
            "failedNodeCount": 0,
            "warningNodeCount": 1
        },
        "warningNodes": [
            {
                "node": "leaf-1",
                "reason": "Backup IP Failed"
            }
        ]
    }
}

TASK [check failed clag members] ******************************************************************************************************************************************************************************************************
Friday 20 October 2017  18:02:05 +0200 (0:00:00.010)       0:00:00.261 ********
ok: [spine-1] => {
    "msg": "Check failed clag members"
}

TASK [clag members status failed] *****************************************************************************************************************************************************************************************************
Friday 20 October 2017  18:02:05 +0200 (0:00:00.011)       0:00:00.273 ********

TASK [clag members status warning] ****************************************************************************************************************************************************************************************************
Friday 20 October 2017  18:02:05 +0200 (0:00:00.007)       0:00:00.281 ********
failed: [spine-1] (item={u'node': u'leaf-1', u'reason': u'Backup IP Failed'}) => {"failed": true, "item": {"node": "leaf-1", "reason": "Backup IP Failed"}, "msg": "Device leaf-1, Why node is in warning state? Backup IP Failed"}

NO MORE HOSTS LEFT ********************************************************************************************************************************************************************************************************************
	to retry, use: --limit @/home/berndonline/cumulus-lab-vagrant/netq_check_clag.retry

PLAY RECAP ****************************************************************************************************************************************************************************************************************************
spine-1                    : ok=4    changed=1    unreachable=0    failed=1

Friday 20 October 2017  18:02:05 +0200 (0:00:00.015)       0:00:00.297 ********
===============================================================================
Gather Clag info in JSON ------------------------------------------------ 0.23s
clag members status warning --------------------------------------------- 0.02s
check failed clag members ----------------------------------------------- 0.01s
output of json_output variable ------------------------------------------ 0.01s
stdout string into json ------------------------------------------------- 0.01s
clag members status failed ---------------------------------------------- 0.01s

Another example is that NetQ reports about a problem that leaf-1 has no matching clagid on peer, in this case on leaf-2 the interface bond1 is missing in the configuration:

PLAY [spine leaf] ***********************************************************************************************************************************************************************************************************************

TASK [Gather Clag info in JSON] *********************************************************************************************************************************************************************************************************
Monday 23 October 2017  18:49:15 +0200 (0:00:00.016)       0:00:00.016 ********
changed: [spine-1]

TASK [stdout string into json] **********************************************************************************************************************************************************************************************************
Monday 23 October 2017  18:49:15 +0200 (0:00:00.223)       0:00:00.240 ********
ok: [spine-1]

TASK [output of json_output variable] ***************************************************************************************************************************************************************************************************
Monday 23 October 2017  18:49:15 +0200 (0:00:00.010)       0:00:00.250 ********
ok: [spine-1] => {
    "json_output": {
        "failedNodes": [
            {
                "node": "leaf-1",
                "reason": "Conflicted Bonds: bond1:matching clagid not configured on peer"
            }
        ],
        "summary": {
            "checkedNodeCount": 4,
            "failedNodeCount": 1,
            "warningNodeCount": 1
        },
        "warningNodes": [
            {
                "node": "leaf-1",
                "reason": "Singly Attached Bonds: bond1"
            }
        ]
    }
}

TASK [check failed clag members] ********************************************************************************************************************************************************************************************************
Monday 23 October 2017  18:49:15 +0200 (0:00:00.010)       0:00:00.260 ********
skipping: [spine-1]

TASK [clag members status failed] *******************************************************************************************************************************************************************************************************
Monday 23 October 2017  18:49:15 +0200 (0:00:00.009)       0:00:00.269 ********
failed: [spine-1] (item={u'node': u'leaf-1', u'reason': u'Conflicted Bonds: bond1:matching clagid not configured on peer'}) => {"failed": true, "item": {"node": "leaf-1", "reason": "Conflicted Bonds: bond1:matching clagid not configured on peer"}, "msg": "Device leaf-1, Why node is in failed state? Conflicted Bonds: bond1:matching clagid not configured on peer"}

NO MORE HOSTS LEFT **********************************************************************************************************************************************************************************************************************
	to retry, use: --limit @/home/berndonline/cumulus-lab-vagrant/netq_check_clag.retry

PLAY RECAP ******************************************************************************************************************************************************************************************************************************
spine-1                    : ok=3    changed=1    unreachable=0    failed=1

Monday 23 October 2017  18:49:15 +0200 (0:00:00.014)       0:00:00.284 ********
===============================================================================
Gather Clag info in JSON ------------------------------------------------ 0.22s
clag members status failed ---------------------------------------------- 0.02s
stdout string into json ------------------------------------------------- 0.01s
output of json_output variable ------------------------------------------ 0.01s
check failed clag members ----------------------------------------------- 0.01s

This is just an example to show what possibilities I have with Cumulus NetQ when I use automation to validate my changes.

There are some information in the Cumulus NetQ documentation about, taking preventive steps with your network: https://docs.cumulusnetworks.com/display/NETQ/Taking+Preventative+Steps+with+Your+Network

Cumulus Linux Ethernet link-state monitoring using ifplugd

This blog post is about link-state monitoring under Cumulus Linux. Cumulus has no own builtin tool for this and recommends using ifplugd. The tool has some similarities to Cisco’s IP SLA which can track the state of interfaces.

The main reason to use ifplugd is for split-brain scenarios, when you lose the peerlink between Cumulus Linux CLAG pairs. If the peerlink goes down the CLAG primary switch stays active member and the secondary would automatically disable all CLAG bonds to force the connected servers to failover to the CLAG primary switch to keep the network operational. 

Very important you need to configure clagd-backup-ip because this is needed for Cumulus Linux to still be able to communicate to it’s neighbour if they lose the peerlink.

Now ifplugd is important for all connected servers which are not using CLAG bonds, basically servers which are using the normal active/standby teaming which doesn’t require a CLAG bonding configuration. These ports are configured as normal access ports, so an peerlink failure would normally keep these ports up if you don’t configure ifplugd.

Ifplugd needs to be installed and configured on both switches running CLAG, follow the steps below.

Install ifplugd service:

sudo apt-get update
sudo apt-get install ifplugd

Edit the file /etc/default/ifplugd and add the lines below

The delay is set to -d10 moderate 10 seconds because of combination with CLAG. Need to see and lower the value over time.

INTERFACES="peerlink"
HOTPLUG_INTERFACES=""
ARGS="-q -f -u0 -d10 -w -I"
SUSPEND_ACTION="stop"

Edit the file /etc/ifplugd/action.d/ifupdown

The variable $SWITCHPORTS defines which ports ifplugd should shutdown if the peerlink goes down. We came up with to use a custom variable instead of shutting down all ports because CLAG is taking care of configured bonds.

#!/bin/sh

# The peerlink bond interface
PEERLINK=peerlink

# The switchports to bring down on peerlink failure
#
# enslosures 01/02: swp5..swp8
SWITCHPORTS=$(seq -f swp%g 5 8)
# storage system 01/02 : swp19..swp22
SWITCHPORTS="$SWITCHPORTS $(seq -f swp%g 19 22)"
# server1/server2: swp27..swp28
SWITCHPORTS="$SWITCHPORTS $(seq -f swp%g 27 28)"
# VMware cluster: swp35..swp38
SWITCHPORTS="$SWITCHPORTS $(seq -f swp%g 35 38)"

case "$1" in

    "$PEERLINK")
        clagrole=$(clagctl | grep "Our Priority" | awk '{print $8}')
	case "$2" in
	    up | down)
		action=$2
		if [ "$clagrole" = "secondary" ]; then
		    for interface in $SWITCHPORTS; do
			echo "bringing $action : $interface"
			ip link set $interface $action
		    done
		fi
		;;
	esac
	;;

esac

Start ifplugd service

sudo systemctl restart ifplugd.service

Impact of a simulated peerlink failure from the server perspective:

2017-09-19T11:43:15.665057+00:00 leaf-01-c ifplugd(peerlink)[5292]: Link beat lost.
2017-09-19T11:43:25.775585+00:00 leaf-01-c ifplugd(peerlink)[5292]: Executing '/etc/ifplugd/ifplugd.action peerlink down'.
2017-09-19T11:43:25.902637+00:00 leaf-01-c ifplugd(peerlink)[5292]: Program executed successfully.
root@leaf-01-c:/home/cumulus# 

root@leaf-02-c:/home/cumulus# grep ifplugd /var/log/syslog
[...]
2017-09-19T11:43:15.780727+00:00 leaf-02-c ifplugd(peerlink)[12600]: Link beat lost.
2017-09-19T11:43:25.891584+00:00 leaf-02-c ifplugd(peerlink)[12600]: Executing '/etc/ifplugd/ifplugd.action peerlink down'.
2017-09-19T11:43:26.107140+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing down : swp5
2017-09-19T11:43:26.146421+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing down : swp6
2017-09-19T11:43:26.171454+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing down : swp7
2017-09-19T11:43:26.193387+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing down : swp8
64 bytes from 8.8.8.8: icmp_seq=1623 ttl=59 time=0.524 ms
64 bytes from 8.8.8.8: icmp_seq=1624 ttl=59 time=0.782 ms
64 bytes from 8.8.8.8: icmp_seq=1625 ttl=59 time=0.847 ms
Request timeout for icmp_seq 1626
Request timeout for icmp_seq 1627
Request timeout for icmp_seq 1628
Request timeout for icmp_seq 1629
Request timeout for icmp_seq 1630
Request timeout for icmp_seq 1631
Request timeout for icmp_seq 1632
Request timeout for icmp_seq 1633
Request timeout for icmp_seq 1634
Request timeout for icmp_seq 1635
Request timeout for icmp_seq 1636
Request timeout for icmp_seq 1637
Request timeout for icmp_seq 1638
64 bytes from 8.8.8.8: icmp_seq=1639 ttl=59 time=0.701 ms
64 bytes from 8.8.8.8: icmp_seq=1640 ttl=59 time=0.708 ms
64 bytes from 8.8.8.8: icmp_seq=1641 ttl=59 time=0.780 ms
64 bytes from 8.8.8.8: icmp_seq=1642 ttl=59 time=0.781 ms

Impact of reconnecting the peerlink from the server perspective:

root@leaf-01-c:/home/cumulus# grep ifplugd /var/log/syslog
[...]
2017-09-19T11:48:22.190187+00:00 leaf-01-c ifplugd(peerlink)[5292]: Link beat detected.
2017-09-19T11:48:22.290481+00:00 leaf-01-c ifplugd(peerlink)[5292]: Executing '/etc/ifplugd/ifplugd.action peerlink up'.
2017-09-19T11:48:22.524673+00:00 leaf-01-c ifplugd(peerlink)[5292]: Program executed successfully.

root@leafsw-f24-02-c:/home/cumulus# grep ifplugd /var/log/syslog
[...]
2017-09-19T11:48:22.084477+00:00 leaf-02-c ifplugd(peerlink)[12600]: Link beat detected.
2017-09-19T11:48:22.232192+00:00 leaf-02-c ifplugd(peerlink)[12600]: Executing '/etc/ifplugd/ifplugd.action peerlink up'.
2017-09-19T11:48:22.812771+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing up : swp5
2017-09-19T11:48:22.816175+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing up : swp6
2017-09-19T11:48:22.831487+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing up : swp7
2017-09-19T11:48:22.836617+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing up : swp8
64 bytes from 8.8.8.8: icmp_seq=24 ttl=59 time=0.614 ms
64 bytes from 8.8.8.8: icmp_seq=25 ttl=59 time=0.680 ms
64 bytes from 8.8.8.8: icmp_seq=26 ttl=59 time=8.932 ms
64 bytes from 8.8.8.8: icmp_seq=27 ttl=59 time=1.126 ms
64 bytes from 8.8.8.8: icmp_seq=28 ttl=59 time=2.424 ms
Request timeout for icmp_seq 29
Request timeout for icmp_seq 30
Request timeout for icmp_seq 31
Request timeout for icmp_seq 32
Request timeout for icmp_seq 33
Request timeout for icmp_seq 34
Request timeout for icmp_seq 35
64 bytes from 8.8.8.8: icmp_seq=36 ttl=59 time=6.491 ms
64 bytes from 8.8.8.8: icmp_seq=37 ttl=59 time=1.045 ms
64 bytes from 8.8.8.8: icmp_seq=38 ttl=59 time=1.244 ms

Yes, it takes a few seconds for your server to reconnect if you have a peerlink failure but it is very important to keep the datacenter network operational.

For more information have a look at the Cumulus Linux documentation: https://docs.cumulusnetworks.com/display/DOCS/ifplugd

Cumulus Linux non-disruptive upgrade procedure on MLAG pairs

I thought it would be useful to know the exact procedure for non-disruptive upgrade on Cumulus Linux MLAG – CLAG pairs. I find the online documentation Upgrading Cumulus Linux a bit short when it comes to running CLAG in what order you have to upgrade the switches with a minimal disruption of traffic..

The following procedure below worked for me on Dell S4048-ON and Dell S3048-ON switches

  • On both switches, run the following command to refresh the package index of the apt repository:
sudo apt-get update
  • Run the following command to determine which switch is CLAG primary- and which switch CLAG secondary:
sudo net show clag

Start upgrading the secondary CLAG member:

  • Shutdown on all interfaces except the peerlink using the commands below.  This will force all traffic through the other switch:
echo swp{1..52} | tr ' ' '\n' | sudo xargs -i ip link set {} down
64 bytes from 8.8.8.8: icmp_seq=8903 ttl=59 time=1.106 ms
64 bytes from 8.8.8.8: icmp_seq=8904 ttl=59 time=0.974 ms
64 bytes from 8.8.8.8: icmp_seq=8905 ttl=59 time=1.643 ms
64 bytes from 8.8.8.8: icmp_seq=8906 ttl=59 time=0.869 ms
Request timeout for icmp_seq 8907
64 bytes from 8.8.8.8: icmp_seq=8908 ttl=59 time=1.256 ms
64 bytes from 8.8.8.8: icmp_seq=8909 ttl=59 time=0.769 ms

(Rollback) If problems are seen revert the change, the commands shown:

echo swp{1..52} | tr ' ' '\n' | sudo xargs -i ip link set {} up

Wait one minute for CLAG to stabilise and verify network communication with the remaining switch.

  • Perform a clean shutdown of clagd on this switch
sudo systemctl stop clagd

(Rollback) If you see problems start clagd again:

sudo systemctl start clagd

Wait one minute for CLAG to cleanly shut down

  • Shutdown peerlink bond
sudo ip link set peerlink down

(Rollback) If you see problems enable peerlink again:

sudo ip link set peerlink up
  • Perform the upgrade using the command:
sudo apt-get upgrade

The reason why it is important to do a clean shutdown of all the ports is that the bridge and peerlink bounces during the package upgrade which could affect the network communication if this happens uncontrolled.

  • Reboot the switch using the command
sudo reboot

Wait for the upgraded switch to come up. This will cause a short outage in traffic.

64 bytes from 8.8.8.8: icmp_seq=9443 ttl=59 time=1.069 ms
64 bytes from 8.8.8.8: icmp_seq=9444 ttl=59 time=1.150 ms
64 bytes from 8.8.8.8: icmp_seq=9445 ttl=59 time=0.993 ms
64 bytes from 8.8.8.8: icmp_seq=9446 ttl=59 time=1.331 ms
Request timeout for icmp_seq 9447
Request timeout for icmp_seq 9448
Request timeout for icmp_seq 9449
64 bytes from 8.8.8.8: icmp_seq=9450 ttl=59 time=1.539 ms
64 bytes from 8.8.8.8: icmp_seq=9451 ttl=59 time=0.908 ms
64 bytes from 8.8.8.8: icmp_seq=9452 ttl=59 time=1.166 ms
64 bytes from 8.8.8.8: icmp_seq=9453 ttl=59 time=1.261 ms

Wait until the network is functioning normally again.

On the secondary, run the following command to take over the primary role:

sudo clagctl priority 0

Wait one minute for CLAG to failvoer

Verify that the CLAG handover has occurred:

sudo net show clag

Repeat steps on the new secondary (old primary) to shutdown all interfaces.

Ones finished you need to reset the clag priority on the primary to its configured default value.

Read my next post, how to rollback if an upgrade failed: Cumulus Linux Snapshot Rollback

Continuous Integration and Delivery for Networking with Cumulus Linux

Continuous Integration – Continuous Delivery (CICD) is becoming more and more popular for network automation but the problem is how to validate your scripts and stage the configuration because you don’t want to deploy untested code to a production system. Especially in networking that could be pretty destructive if you made a mistake which could cause a loss in connectivity.

I spend some days working on a Cumulus Linux lab using Vagrant which I use to stage configuration. You find the basic Ansible playbook and the gitlab-ci configuration for the Cumulus lab in my Github repo: cumulus-lab-provision

For the continuous integration and delivery (CI/CD) pipeline I am using Gitlab.com and their Gitlab-runner which is running on my server. I will not get into too much detail what is needed on the server, basically it runs vargant, libvirt (kvm), virtualbox, ansible and the gitlab-runner.

  • You need to register your Gitlab-runner with the Gitlab repository.

  • The next step is to create your .gitlab-ci.yml which defines your CI-pipeline.
---
stages:
    - validate ansible
    - staging
    - production
validate:
    stage: validate ansible
    script:
        - bash ./linter.sh
staging:
    before_script:
        - git clone https://github.com/berndonline/cumulus-lab-vagrant.git
        - cd cumulus-lab-vagrant/
        - python ./topology_converter.py ./topology-staging.dot
          -p libvirt --ansible-hostfile
    stage: staging
    script:
        - bash ../staging.sh
production:
    before_script:
        - git clone https://github.com/berndonline/cumulus-lab-vagrant.git
        - cd cumulus-lab-vagrant/
        - python ./topology_converter.py ./topology-production.dot
          -p libvirt --ansible-hostfile
    stage: production
    when: manual
    script:
        - bash ../production.sh
    only:
        - master

In the gitlab-ci you see that I clone the cumulus vagrant lab which I use to spin-up a virtual staging environment and run the Ansible playbook against the virtual lab. The production stage is in my example also a vagrant environment because I had no physical switches for testing.

  • Basically any commit or merge in the Gitlab repo triggers the pipeline which I define in the gitlab-ci.

  • You can see the details in the running job. The first stage is only to validate that the YAML files have the correct syntax.

  • Here the details of the running job of staging and when everything goes well the job succeeded.

  • The last stage is production which needs to be triggered manually.

  • After the changes run through all defined stages you see that you successfully validate, staged and deployed your configuration to a cumulus production system.

This is a complete different way of working for a network engineer but the way it goes in fully automated datacenter network environments. It gets very powerful when you combine this with the Cumulus NetQ server to validate the state of your switch fabric after you run changes in production.

The next topic I am working on, is using Cumulus NetQ to validate configuration changes.

Here again my two repositories I use:

https://github.com/berndonline/cumulus-lab-vagrant

https://github.com/berndonline/cumulus-lab-provision

Read my new posts about an Ansible Playbook for Cumulus Linux BGP IP-Fabric and Cumulus NetQ Validation and BGP EVPN and VXLAN with Cumulus Linux.

Cumulus Linux network simulation using Vagrant

I was using GNS3 for quite some time but it was not very flexible if you quickly wanted to test something and even more complicated if you used a different computer or shared your projects.

I spend some time with Vagrant to build a virtual Cumulus Linux lab environment which can run basically on every computer. Simulating network environments is the future when you want to test and validate your automation scripts.

My lab diagram:

I created different topology.dot files and used the Cumulus topology converter on Github to create my lab with Virtualbox or Libvirt (KVM). I did some modification to the initialise scripts for the switches and the management server. Everything you find in my Github repo https://github.com/berndonline/cumulus-lab-vagrant.

The topology file basically defines your network and the converter creates the Vagrantfile.

In the management topology file you have all servers (incl. management) like in the network diagram above. The Cumulus switches you can only access via the management server.

Very similar to the topology-mgmt.dot but in this one the management server is running Cumulus NetQ which you need to first import into your Vagrant. Here the link to the Cumulus NetQ demo on Github.

In this topology file you find a basic staging lab without servers where you can access the Cumulus switches directly via their Vagrant IP. I mainly use this to quickly test something like updating Cumulus switches or validating Ansible playbooks.

In this topology file you find a basic production lab where you can access the Cumulus switches directly via their Vagrant IP and have Cumulus NetQ as management server.

Basically to convert a topology into a Vagrantfile you just need to run the following command:

python topology_converter.py topology-staging.dot -p libvirt --ansible-hostfile

I use KVM in my example and want that Vagrant creates an Ansible inventory file and run playbooks directly agains the switches.

Check the status of the vagrant environment:

berndonline@lab:~/cumulus-lab-vagrant$ vagrant status
Current machine states:

spine-1                   not created (libvirt)
spine-2                   not created (libvirt)
leaf-1                    not created (libvirt)
leaf-3                    not created (libvirt)
leaf-2                    not created (libvirt)
leaf-4                    not created (libvirt)
mgmt-1                    not created (libvirt)
edge-2                    not created (libvirt)
edge-1                    not created (libvirt)

This environment represents multiple VMs. The VMs are all listed
above with their current state. For more information about a specific
VM, run `vagrant status NAME`.
berndonline@lab:~/cumulus-lab-vagrant$

To start the devices run:

vagrant up

If you use the topology files with management server you need to start first the management server and then the management switch before you boot the rest of the switches:

vagrant up mgmt-server
vagrant up mgmt-1
vagrant up

The switches will pull some part of their configuration from the management server.

Output if you start the environment:

berndonline@lab:~/cumulus-lab-vagrant$ vagrant up spine-1
Bringing machine 'spine-1' up with 'libvirt' provider...
==> spine-1: Creating image (snapshot of base box volume).
==> spine-1: Creating domain with the following settings...
==> spine-1:  -- Name:              cumulus-lab-vagrant_spine-1
==> spine-1:  -- Domain type:       kvm
==> spine-1:  -- Cpus:              1
==> spine-1:  -- Feature:           acpi
==> spine-1:  -- Feature:           apic
==> spine-1:  -- Feature:           pae
==> spine-1:  -- Memory:            512M
==> spine-1:  -- Management MAC:
==> spine-1:  -- Loader:
==> spine-1:  -- Base box:          CumulusCommunity/cumulus-vx
==> spine-1:  -- Storage pool:      default
==> spine-1:  -- Image:             /var/lib/libvirt/images/cumulus-lab-vagrant_spine-1.img (4G)
==> spine-1:  -- Volume Cache:      default
==> spine-1:  -- Kernel:
==> spine-1:  -- Initrd:
==> spine-1:  -- Graphics Type:     vnc
==> spine-1:  -- Graphics Port:     5900
==> spine-1:  -- Graphics IP:       127.0.0.1
==> spine-1:  -- Graphics Password: Not defined
==> spine-1:  -- Video Type:        cirrus
==> spine-1:  -- Video VRAM:        9216
==> spine-1:  -- Sound Type:
==> spine-1:  -- Keymap:            en-us
==> spine-1:  -- TPM Path:
==> spine-1:  -- INPUT:             type=mouse, bus=ps2
==> spine-1: Creating shared folders metadata...
==> spine-1: Starting domain.
==> spine-1: Waiting for domain to get an IP address...
==> spine-1: Waiting for SSH to become available...
    spine-1:
    spine-1: Vagrant insecure key detected. Vagrant will automatically replace
    spine-1: this with a newly generated keypair for better security.
    spine-1:
    spine-1: Inserting generated public key within guest...
    spine-1: Removing insecure key from the guest if it's present...
    spine-1: Key inserted! Disconnecting and reconnecting using new SSH key...
==> spine-1: Setting hostname...
==> spine-1: Configuring and enabling network interfaces...
....
==> spine-1: #################################
==> spine-1:   Running Switch Post Config (config_vagrant_switch.sh)
==> spine-1: #################################
==> spine-1:  ###Creating SSH keys for cumulus user ###
==> spine-1: #################################
==> spine-1:    Finished
==> spine-1: #################################
==> spine-1: Running provisioner: shell...
    spine-1: Running: inline script
==> spine-1: Running provisioner: shell...
    spine-1: Running: inline script
==> spine-1:   INFO: Adding UDEV Rule: a0:00:00:00:00:21 --> eth0
==> spine-1: Running provisioner: shell...
    spine-1: Running: inline script
==> spine-1:   INFO: Adding UDEV Rule: 44:38:39:00:00:30 --> swp1
==> spine-1: Running provisioner: shell...
    spine-1: Running: inline script
==> spine-1:   INFO: Adding UDEV Rule: 44:38:39:00:00:04 --> swp2
==> spine-1: Running provisioner: shell...
    spine-1: Running: inline script
==> spine-1:   INFO: Adding UDEV Rule: 44:38:39:00:00:26 --> swp3
==> spine-1: Running provisioner: shell...
    spine-1: Running: inline script
==> spine-1:   INFO: Adding UDEV Rule: 44:38:39:00:00:0a --> swp4
==> spine-1: Running provisioner: shell...
    spine-1: Running: inline script
==> spine-1:   INFO: Adding UDEV Rule: 44:38:39:00:00:22 --> swp51
==> spine-1: Running provisioner: shell...
    spine-1: Running: inline script
==> spine-1:   INFO: Adding UDEV Rule: 44:38:39:00:00:0d --> swp52
==> spine-1: Running provisioner: shell...
    spine-1: Running: inline script
==> spine-1:   INFO: Adding UDEV Rule: 44:38:39:00:00:10 --> swp53
==> spine-1: Running provisioner: shell...
    spine-1: Running: inline script
==> spine-1:   INFO: Adding UDEV Rule: 44:38:39:00:00:23 --> swp54
==> spine-1: Running provisioner: shell...
    spine-1: Running: inline script
==> spine-1:   INFO: Adding UDEV Rule: Vagrant interface = eth1
==> spine-1: #### UDEV Rules (/etc/udev/rules.d/70-persistent-net.rules) ####
==> spine-1: ACTION=="add", SUBSYSTEM=="net", ATTR{address}=="a0:00:00:00:00:21", NAME="eth0", SUBSYSTEMS=="pci"
==> spine-1: ACTION=="add", SUBSYSTEM=="net", ATTR{address}=="44:38:39:00:00:30", NAME="swp1", SUBSYSTEMS=="pci"
==> spine-1: ACTION=="add", SUBSYSTEM=="net", ATTR{address}=="44:38:39:00:00:04", NAME="swp2", SUBSYSTEMS=="pci"
==> spine-1: ACTION=="add", SUBSYSTEM=="net", ATTR{address}=="44:38:39:00:00:26", NAME="swp3", SUBSYSTEMS=="pci"
==> spine-1: ACTION=="add", SUBSYSTEM=="net", ATTR{address}=="44:38:39:00:00:0a", NAME="swp4", SUBSYSTEMS=="pci"
==> spine-1: ACTION=="add", SUBSYSTEM=="net", ATTR{address}=="44:38:39:00:00:22", NAME="swp51", SUBSYSTEMS=="pci"
==> spine-1: ACTION=="add", SUBSYSTEM=="net", ATTR{address}=="44:38:39:00:00:0d", NAME="swp52", SUBSYSTEMS=="pci"
==> spine-1: ACTION=="add", SUBSYSTEM=="net", ATTR{address}=="44:38:39:00:00:10", NAME="swp53", SUBSYSTEMS=="pci"
==> spine-1: ACTION=="add", SUBSYSTEM=="net", ATTR{address}=="44:38:39:00:00:23", NAME="swp54", SUBSYSTEMS=="pci"
==> spine-1: ACTION=="add", SUBSYSTEM=="net", ATTR{ifindex}=="2", NAME="eth1", SUBSYSTEMS=="pci"
==> spine-1: Running provisioner: shell...
    spine-1: Running: inline script
==> spine-1: ### RUNNING CUMULUS EXTRA CONFIG ###
==> spine-1:   INFO: Detected a 3.x Based Release
==> spine-1: ### Disabling default remap on Cumulus VX...
==> spine-1: ### Disabling ZTP service...
==> spine-1: Removed symlink /etc/systemd/system/multi-user.target.wants/ztp.service.
==> spine-1: ### Resetting ZTP to work next boot...
==> spine-1: Created symlink from /etc/systemd/system/multi-user.target.wants/ztp.service to /lib/systemd/system/ztp.service.
==> spine-1:   INFO: Detected Cumulus Linux v3.3.2 Release
==> spine-1: ### Fixing ONIE DHCP to avoid Vagrant Interface ###
==> spine-1:      Note: Installing from ONIE will undo these changes.
==> spine-1: ### Giving Vagrant User Ability to Run NCLU Commands ###
==> spine-1: ### DONE ###
==> spine-1: ### Rebooting Device to Apply Remap...

At the end you are able to connect to the Cumulus switch:

berndonline@lab:~/cumulus-lab-vagrant$ vagrant ssh spine-1

Welcome to Cumulus VX (TM)

Cumulus VX (TM) is a community supported virtual appliance designed for
experiencing, testing and prototyping Cumulus Networks' latest technology.
For any questions or technical support, visit our community site at:
http://community.cumulusnetworks.com

The registered trademark Linux (R) is used pursuant to a sublicense from LMI,
the exclusive licensee of Linus Torvalds, owner of the mark on a world-wide
basis.
vagrant@cumulus:~$

To destroy the Vagrant environment:

berndonline@lab:~/cumulus-lab-vagrant$ vagrant destroy spine-1
==> spine-2: Remove stale volume...
==> spine-2: Domain is not created. Please run `vagrant up` first.
==> spine-1: Removing domain...

My goal is to adopt some NetDevOps practice and use this in networking = NetOps, currently working on an Continuous Integration and Delivery (CI/CD) pipeline for Cumulus Linux network environments. The Vagrant lab was one of the prerequisites to simulate the changes before deploying this to production but more will follow in my next blog post.

Read my new post about an Ansible Playbook for Cumulus Linux BGP IP-Fabric and Cumulus NetQ Validation.