Cumulus Networks Case Study

Cumulus Networks published a new case study about my work with them on my recent datacenter network rebuild using Cumulus Linux on Dell Open Networking switches and about how we have used Cumulus NetQ as fabric validation system.

Have a look and read the case study here:

Link: https://cumulusnetworks.com/customers/smartgames-technologies/

Cumulus also published a press release a few weeks ago with one of my quotes about NetQ I made when we were working on the case study.

“With NetQ, we can run small check commands and see what really is going on in our network,” said Bernd Malmqvist, Tech Lead Systems Operations at SmartGames Technologies. “The benefits to us are early alerting and validating the entire state of the fabric. Monitoring is one thing, but with NetQ, the knowledge is instant. NetQ is really unique; it’s a tool that tells us exactly what is wrong in our environment, and the insight to know where an issue is stemming from.”

You can read the full press release read here:

https://cumulusnetworks.com/about/press-releases/cumulus-networks-bolsters-cumulus-netq-kubernetes-integration-provide-network-operators-actionable-insight-container-networking/

Internet Edge and WAN Routing with Cumulus Linux

With this article I wanted to focus on something different than the usual spine and leaf topology and talk about datacenter edge routing.

I was using Cisco routers for many years for Internet Edge and WAN connectivity. The problem with using a vendor like Cisco is the price tag you have to pay and there still might a reason for it to spend the money. But nowadays you get leased-lines handed over as normal Ethernet connection and using a dedicated routers maybe not always necessary if you are not getting too crazy with BGP routing or quality of service.

I was experimenting over the last weeks if I could use a Cumulus Linux switch as an Internet Edge and Wide Area Network router with running different VRFs for internet and WAN connectivity. I came up with the following edge network layout you see below:

For this network, I build an Vagrant topology with Cumulus VX to simulate the edge routing and being able to test the connectivity. Below you see a more detailed view of the Vagrant topology:

Everything is running on Cumulus VX even the firewalls because I just wanted to simulate the traffic flow and see if the network communication is functioning. Also having separate WAN switches might be useful because 1Gbit/s switches are cheaper then 40Gbit/s switches and you need additional SFP for 1Gbit/s connections, another point is to separate your layer 2 WAN connectivity from your internal datacenter network.

Here the assigned IP addresses for this lab:

wan-1 VLAN801 PIP: 217.0.1.2/29 VIP: 217.0.1.1/29
wan-2 VLAN801 PIP: 217.0.1.3/29 VIP: 217.0.1.1/29
wan-1 VLAN802 PIP: 10.100.0.1/29 
wan-2 VLAN802 PIP: 10.100.0.2/29
wan-1 VLAN904 PIP: 217.0.0.2/28 VIP: 217.0.0.1/28
wan-2 VLAN904 PIP: 217.0.0.3/28 VIP: 217.0.0.1/28
fw-1 VLAN904 PIP: 217.0.0.14/28
wan-1 VLAN903 PIP: 10.0.255.34/28 VIP: 10.0.255.33/28
wan-2 VLAN903 PIP: 10.0.255.35/28 VIP: 10.0.255.33/28
fw-2 VLAN903 PIP: 10.0.255.46/28
edge-1 VLAN901 PIP: 10.0.255.2/28 VIP: 10.0.255.1/28
edge-2 VLAN901 PIP: 10.0.255.3/28 VIP: 10.0.255.1/28
fw-1 VLAN901 PIP: 10.0.255.14/28
fw-2 VLAN901 PIP: 10.0.255.12/28
edge-1 VLAN902 PIP: 10.0.255.18/28 VIP: 10.0.255.17/28
edge-2 VLAN902 PIP: 10.0.255.19/28 VIP: 10.0.255.17/28
fw-1 VLAN902 PIP: 10.0.255.30/28

You can find the Github repository for the Vagrant topology here: https://github.com/berndonline/cumulus-edge-vagrant

[email protected]:~/cumulus-edge-vagrant$ vagrant status
Current machine states:

fw-2                      running (libvirt)
fw-1                      running (libvirt)
mgmt-1                    running (libvirt)
edge-2                    running (libvirt)
edge-1                    running (libvirt)
wan-1                     running (libvirt)
wan-2                     running (libvirt)

This environment represents multiple VMs. The VMs are all listed
above with their current state. For more information about a specific
VM, run `vagrant status NAME`.
[email protected]:~/cumulus-edge-vagrant$

I wrote as well an Ansible Playbook to deploy the initial configuration which you can find here: https://github.com/berndonline/cumulus-edge-provision

Let’s execute the playbook:

[email protected]:~/cumulus-edge-vagrant$ ansible-playbook ../cumulus-edge-provision/site.yml

PLAY [edge] ********************************************************************************************************************************************************

TASK [switchgroups : create switch groups based on clag_pairs] *****************************************************************************************************
skipping: [edge-2] => (item=(u'wan', [u'wan-1', u'wan-2']))
skipping: [edge-1] => (item=(u'wan', [u'wan-1', u'wan-2']))
ok: [edge-2] => (item=(u'edge', [u'edge-1', u'edge-2']))
ok: [wan-1] => (item=(u'wan', [u'wan-1', u'wan-2']))
skipping: [wan-1] => (item=(u'edge', [u'edge-1', u'edge-2']))
ok: [edge-1] => (item=(u'edge', [u'edge-1', u'edge-2']))
ok: [wan-2] => (item=(u'wan', [u'wan-1', u'wan-2']))
skipping: [wan-2] => (item=(u'edge', [u'edge-1', u'edge-2']))

TASK [switchgroups : include switch group variables] ***************************************************************************************************************
skipping: [edge-2] => (item=(u'wan', [u'wan-1', u'wan-2']))
skipping: [edge-1] => (item=(u'wan', [u'wan-1', u'wan-2']))
ok: [wan-1] => (item=(u'wan', [u'wan-1', u'wan-2']))
skipping: [wan-1] => (item=(u'edge', [u'edge-1', u'edge-2']))
ok: [wan-2] => (item=(u'wan', [u'wan-1', u'wan-2']))
skipping: [wan-2] => (item=(u'edge', [u'edge-1', u'edge-2']))
ok: [edge-2] => (item=(u'edge', [u'edge-1', u'edge-2']))
ok: [edge-1] => (item=(u'edge', [u'edge-1', u'edge-2']))

...

RUNNING HANDLER [interfaces : reload networking] *******************************************************************************************************************
changed: [edge-2] => (item=ifreload -a)
changed: [edge-1] => (item=ifreload -a)
changed: [wan-1] => (item=ifreload -a)
changed: [wan-2] => (item=ifreload -a)
changed: [edge-2] => (item=sleep 10)
changed: [edge-1] => (item=sleep 10)
changed: [wan-2] => (item=sleep 10)
changed: [wan-1] => (item=sleep 10)

RUNNING HANDLER [routing : reload frr] *****************************************************************************************************************************
changed: [edge-2]
changed: [wan-1]
changed: [wan-2]
changed: [edge-1]

RUNNING HANDLER [ptm : restart ptmd] *******************************************************************************************************************************
changed: [edge-2]
changed: [edge-1]
changed: [wan-2]
changed: [wan-1]

RUNNING HANDLER [ntp : restart ntp] ********************************************************************************************************************************
changed: [wan-1]
changed: [edge-1]
changed: [wan-2]
changed: [edge-2]

RUNNING HANDLER [ifplugd : restart ifplugd] ************************************************************************************************************************
changed: [edge-1]
changed: [wan-1]
changed: [edge-2]
changed: [wan-2]

PLAY RECAP *********************************************************************************************************************************************************
edge-1                     : ok=21   changed=17   unreachable=0    failed=0
edge-2                     : ok=21   changed=17   unreachable=0    failed=0
wan-1                      : ok=21   changed=17   unreachable=0    failed=0
wan-2                      : ok=21   changed=17   unreachable=0    failed=0

[email protected]:~/cumulus-edge-vagrant$

At last but not least I wrote a simple Ansible Playbook for connectivity testing using ping what you can find here: https://github.com/berndonline/cumulus-edge-provision/blob/master/icmp_check.yml

[email protected]:~/cumulus-edge-vagrant$ ansible-playbook ../cumulus-edge-provision/check_icmp.yml

PLAY [exit edge] *********************************************************************************************************************************************************************************************************************

TASK [connectivity check from frontend firewall] *************************************************************************************************************************************************************************************
skipping: [fw-2] => (item=10.0.255.33)
skipping: [fw-2] => (item=10.0.255.17)
skipping: [fw-2] => (item=10.0.255.1)
skipping: [fw-2] => (item=217.0.0.1)
skipping: [edge-2] => (item=10.0.255.33)
skipping: [edge-2] => (item=10.0.255.17)
skipping: [edge-2] => (item=10.0.255.1)
skipping: [edge-1] => (item=10.0.255.33)
skipping: [edge-2] => (item=217.0.0.1)
skipping: [edge-1] => (item=10.0.255.17)
skipping: [edge-1] => (item=10.0.255.1)
skipping: [wan-1] => (item=10.0.255.33)
skipping: [edge-1] => (item=217.0.0.1)
skipping: [wan-1] => (item=10.0.255.17)
skipping: [wan-1] => (item=10.0.255.1)
skipping: [wan-1] => (item=217.0.0.1)
skipping: [wan-2] => (item=10.0.255.33)
skipping: [wan-2] => (item=10.0.255.17)
skipping: [wan-2] => (item=10.0.255.1)
skipping: [wan-2] => (item=217.0.0.1)
changed: [fw-1] => (item=10.0.255.33)
changed: [fw-1] => (item=10.0.255.17)
changed: [fw-1] => (item=10.0.255.1)
changed: [fw-1] => (item=217.0.0.1)
...
PLAY RECAP ***************************************************************************************************************************************************************************************************************************
edge-1                     : ok=2    changed=2    unreachable=0    failed=0
edge-2                     : ok=2    changed=2    unreachable=0    failed=0
fw-1                       : ok=1    changed=1    unreachable=0    failed=0
fw-2                       : ok=1    changed=1    unreachable=0    failed=0
wan-1                      : ok=2    changed=2    unreachable=0    failed=0
wan-2                      : ok=2    changed=2    unreachable=0    failed=0

[email protected]:~/cumulus-edge-vagrant$

The icmp check shows that in general the edge routing is working but I need to do some further testing with this if this can be used in a production environment.

If using switch hardware is not the right fit you can still install and use Free Range Routing (FRR) from Cumulus Networks on other Linux distributions and pick server hardware for your own custom edge router. I would only recommend checking Linux kernel support for VRF when choosing another Linux OS. Also have a look at my article about Open Source Routing GRE over IPSec with StrongSwan and Cisco IOS-XE where I build a Debian software router.

Please share your feedback and leave a comment.

Network Monitoring with Prometheus and Cumulus Linux

As promised in my previous article Install Prometheus and Grafana, this post is about how to monitor Cumulus Linux switches with Prometheus.

Let’s start directly by installing the Prometheus Node_Exporter:

sudo useradd --no-create-home --shell /bin/false node_exporter

tar xvf node_exporter-0.15.1.linux-amd64.tar.gz
sudo cp node_exporter-0.15.1.linux-amd64/node_exporter /usr/local/bin
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

sudo bash -c 'cat << EOF > /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF'

sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl status node_exporter

Check that the Node_Exporter service is correctly running and listing on tcp 9100 for the Prometheus server to collect the metrics from the switches:

[email protected]:~$ sudo systemctl status node_exporter
● node_exporter.service - Node Exporter
   Loaded: loaded (/etc/systemd/system/node_exporter.service; disabled)
   Active: active (running) since Thu 2018-03-22 13:41:26 UTC; 958ms ago
 Main PID: 5620 (node_exporter)
   CGroup: /system.slice/node_exporter.service
           └─5620 /usr/local/bin/node_exporter

Mar 22 13:41:26 spine-2 node_exporter[5620]: time="2018-03-22T13:41:26Z" level=info msg=" - sockstat" source="node_exporter.go:52"
Mar 22 13:41:26 spine-2 node_exporter[5620]: time="2018-03-22T13:41:26Z" level=info msg=" - bcache" source="node_exporter.go:52"
Mar 22 13:41:26 spine-2 node_exporter[5620]: time="2018-03-22T13:41:26Z" level=info msg=" - hwmon" source="node_exporter.go:52"
Mar 22 13:41:26 spine-2 node_exporter[5620]: time="2018-03-22T13:41:26Z" level=info msg=" - cpu" source="node_exporter.go:52"
Mar 22 13:41:26 spine-2 node_exporter[5620]: time="2018-03-22T13:41:26Z" level=info msg=" - stat" source="node_exporter.go:52"
Mar 22 13:41:26 spine-2 node_exporter[5620]: time="2018-03-22T13:41:26Z" level=info msg=" - timex" source="node_exporter.go:52"
Mar 22 13:41:26 spine-2 node_exporter[5620]: time="2018-03-22T13:41:26Z" level=info msg=" - textfile" source="node_exporter.go:52"
Mar 22 13:41:26 spine-2 node_exporter[5620]: time="2018-03-22T13:41:26Z" level=info msg=" - conntrack" source="node_exporter.go:52"
Mar 22 13:41:26 spine-2 node_exporter[5620]: time="2018-03-22T13:41:26Z" level=info msg=" - edac" source="node_exporter.go:52"
Mar 22 13:41:26 spine-2 node_exporter[5620]: time="2018-03-22T13:41:26Z" level=info msg="Listening on :9100" source="node_exporter.go:76"
[email protected]:~$

I created a simple dashboard in Grafana for the switches running Cumulus Linux, where you can find important metrics like throughput of the network interfaces, CPU load, Memory and disk related information:

On the top right corner you can select the switch where you want to see metrics from:

You can also have a central monitoring dashboard where all performance metrics are shown:

Here are detailed views with information about all interfaces from the different switch groups:

This is a very simple solution to monitor your Cumulus Linux switches and in combination with Cumulus NetQ enough to monitor your switch fabric.

FYI, I have used the following virtual topology BGP EVPN and VXLAN with Cumulus Linux.

Please share your feedback and leave a comment.

Install Prometheus and Grafana

Moving away from Cisco and using Open Networking whitebox switches with Cumulus Linux made me think about performance monitoring. In the past I was a fan of Solarwinds NPM but the traditional SNMP based monitoring is pretty outdated and not standard anymore when using Linux based operating systems. I was exploring different other options and came across Prometheus and Grafana.

This is post about how to install Prometheus and Grafana on a central monitoring server, the next post will be about how to integrate Cumulus Linux switches and report metrics to Prometheus and then visualise them with Grafana.

Let’s start installing Prometheus base packages:

sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir /etc/prometheus
sudo mkdir /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

cd ~
curl -LO https://github.com/prometheus/prometheus/releases/download/v2.0.0/prometheus-2.0.0.linux-amd64.tar.gz
tar xvf prometheus-2.0.0.linux-amd64.tar.gz
sudo cp prometheus-2.0.0.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-2.0.0.linux-amd64/promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool
sudo cp -r prometheus-2.0.0.linux-amd64/consoles /etc/prometheus
sudo cp -r prometheus-2.0.0.linux-amd64/console_libraries /etc/prometheus
sudo chown -R prometheus:prometheus /etc/prometheus/consoles
sudo chown -R prometheus:prometheus /etc/prometheus/console_libraries
rm -rf prometheus-2.0.0.linux-amd64.tar.gz prometheus-2.0.0.linux-amd64

sudo touch /etc/prometheus/prometheus.yml 
sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml

sudo bash -c 'cat << EOF > /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
    --config.file /etc/prometheus/prometheus.yml \
    --storage.tsdb.path /var/lib/prometheus/ \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries

[Install]
WantedBy=multi-user.target
EOF'

We have now installed the Prometheus base package but to collect metrics you also need to install the Prometheus Node Exporter:

sudo useradd --no-create-home --shell /bin/false node_exporter

cd ~
curl -LO https://github.com/prometheus/node_exporter/releases/download/v0.15.1/node_exporter-0.15.1.linux-amd64.tar.gz
tar xvf node_exporter-0.15.1.linux-amd64.tar.gz
sudo cp node_exporter-0.15.1.linux-amd64/node_exporter /usr/local/bin
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
rm -rf node_exporter-0.15.1.linux-amd64.tar.gz node_exporter-0.15.1.linux-amd64

sudo bash -c 'cat << EOF > /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF'

Configure Prometheus and define node_exporter targets:

sudo bash -c 'cat << EOF > /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'node_exporter'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9100']  
EOF'

Start services and access the web console:

sudo systemctl daemon-reload
sudo systemctl start prometheus
sudo systemctl start node_exporter

Access the Prometheus web console via http://localhost:9090:

Under “Status -> Targets” you can check if the services state is up:

Let’s continue by installing Grafana:

curl https://packagecloud.io/gpg.key | sudo apt-key add -
sudo add-apt-repository "deb https://packagecloud.io/grafana/stable/debian/ stretch main"
sudo apt-get update
sudo apt-get install grafana
sudo systemctl start grafana-server
sudo systemctl status grafana-server
sudo systemctl enable grafana-server

Now you can access Grafana via http://localhost:3000/. I would recommend putting a Ngnix reverse proxy in-front for SSL encryption.

In the web console we need to configure the data source and point it to Prometheus. To do that go to “settings” and select “data source”:

You should import the following Prometheus dashboard for Grafana otherwise you need to manually configure your dashboard:

For the install of Prometheus and the Node_Exporter I will write two Ansible roles which I will share later. Read my new post about Network Monitoring with Prometheus and Cumulus Linux!

Please share your feedback and leave a comment.

Getting started with Jenkins for Network Automation

As I have mentioned my previous post about Getting started with Gitlab-CI for Network Automation, Jenkins is another continuous integration pipelining tool you can use for network automation. Have a look about how to install Jenkins: https://wiki.jenkins.io/display/JENKINS/Installing+Jenkins+on+Ubuntu

To use the Jenkins with Vagrant and KVM (libvirt) there are a few changes needed on the linux server similar with the Gitlab-Runner. The Jenkins user account needs to be able to control KVM and you need to install the vagrant-libvirt plugin:

usermod -aG libvirtd jenkins
sudo su jenkins
vagrant plugin install vagrant-libvirt

Optional: you may need to copy custom Vagrant boxes into the users vagrant folder ‘/var/lib/jenkins/.vagrant.d/boxes/*’. Note that the Jenkins home directory is not located under /home.

Now lets start configuring a Jenkins CI-pipeline, click on ‘New item’:

This creates an empty pipeline where you need to add the different stages  of what needs to be executed:

Below is an example Jenkins pipeline script which is very similar to the Gitlab-CI pipeline I have used with my Cumulus Linux Lab in the past.

pipeline {
    agent any
    stages {
        stage('Clean and prep workspace') {
            steps {
                sh 'rm -r *'
                git 'https://github.com/berndonline/cumulus-lab-provision'
                sh 'git clone --origin master https://github.com/berndonline/cumulus-lab-vagrant'
            }
        }
        stage('Validate Ansible') {
            steps {
                sh 'bash ./linter.sh'
            }
        }
        stage('Staging') {
            steps {
                sh 'cd ./cumulus-lab-vagrant/ && ./vagrant_create.sh'
                sh 'cd ./cumulus-lab-vagrant/ && bash ../staging.sh'
            }
        }
        stage('Deploy production approval') {
            steps {
                input 'Deploy to prod?'
            }
        }
        stage('Production') {
            steps {
                sh 'cd ./cumulus-lab-vagrant/ && ./vagrant_create.sh'
                sh 'cd ./cumulus-lab-vagrant/ && bash ../production.sh'
            }
        }
    }
}

Let’s run the build pipeline:

The stages get executed one by one and, as you can see below, the production stage has an manual approval build-in that nothing gets deployed to production without someone to approve before, for a controlled production deployment:

Finished pipeline:

This is just a simple example of a network automation pipeline, this can of course be more complex if needed. It should just help you a bit on how to start using Jenkins for network automation.

Please share your feedback and leave a comment.