Cumulus Linux Ethernet link-state monitoring using ifplugd

This blog post is about link-state monitoring under Cumulus Linux. Cumulus has no own builtin tool for this and recommends using ifplugd. The tool has some similarities to Cisco’s IP SLA which can track the state of interfaces.

The main reason to use ifplugd is for split-brain scenarios, when you lose the peerlink between Cumulus Linux CLAG pairs. If the peerlink goes down the CLAG primary switch stays active member and the secondary would automatically disable all CLAG bonds to force the connected servers to failover to the CLAG primary switch to keep the network operational. 

Very important you need to configure clagd-backup-ip because this is needed for Cumulus Linux to still be able to communicate to it’s neighbour if they lose the peerlink.

Now ifplugd is important for all connected servers which are not using CLAG bonds, basically servers which are using the normal active/standby teaming which doesn’t require a CLAG bonding configuration. These ports are configured as normal access ports, so an peerlink failure would normally keep these ports up if you don’t configure ifplugd.

Ifplugd needs to be installed and configured on both switches running CLAG, follow the steps below.

Install ifplugd service:

sudo apt-get update
sudo apt-get install ifplugd

Edit the file /etc/default/ifplugd and add the lines below

The delay is set to -d10 moderate 10 seconds because of combination with CLAG. Need to see and lower the value over time.

INTERFACES="peerlink"
HOTPLUG_INTERFACES=""
ARGS="-q -f -u0 -d10 -w -I"
SUSPEND_ACTION="stop"

Edit the file /etc/ifplugd/action.d/ifupdown

The variable $SWITCHPORTS defines which ports ifplugd should shutdown if the peerlink goes down. We came up with to use a custom variable instead of shutting down all ports because CLAG is taking care of configured bonds.

#!/bin/sh

# The peerlink bond interface
PEERLINK=peerlink

# The switchports to bring down on peerlink failure
#
# enslosures 01/02: swp5..swp8
SWITCHPORTS=$(seq -f swp%g 5 8)
# storage system 01/02 : swp19..swp22
SWITCHPORTS="$SWITCHPORTS $(seq -f swp%g 19 22)"
# server1/server2: swp27..swp28
SWITCHPORTS="$SWITCHPORTS $(seq -f swp%g 27 28)"
# VMware cluster: swp35..swp38
SWITCHPORTS="$SWITCHPORTS $(seq -f swp%g 35 38)"

case "$1" in

    "$PEERLINK")
        clagrole=$(clagctl | grep "Our Priority" | awk '{print $8}')
	case "$2" in
	    up | down)
		action=$2
		if [ "$clagrole" = "secondary" ]; then
		    for interface in $SWITCHPORTS; do
			echo "bringing $action : $interface"
			ip link set $interface $action
		    done
		fi
		;;
	esac
	;;

esac

Start ifplugd service

sudo systemctl restart ifplugd.service

Impact of a simulated peerlink failure from the server perspective:

2017-09-19T11:43:15.665057+00:00 leaf-01-c ifplugd(peerlink)[5292]: Link beat lost.
2017-09-19T11:43:25.775585+00:00 leaf-01-c ifplugd(peerlink)[5292]: Executing '/etc/ifplugd/ifplugd.action peerlink down'.
2017-09-19T11:43:25.902637+00:00 leaf-01-c ifplugd(peerlink)[5292]: Program executed successfully.
root@leaf-01-c:/home/cumulus# 

root@leaf-02-c:/home/cumulus# grep ifplugd /var/log/syslog
[...]
2017-09-19T11:43:15.780727+00:00 leaf-02-c ifplugd(peerlink)[12600]: Link beat lost.
2017-09-19T11:43:25.891584+00:00 leaf-02-c ifplugd(peerlink)[12600]: Executing '/etc/ifplugd/ifplugd.action peerlink down'.
2017-09-19T11:43:26.107140+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing down : swp5
2017-09-19T11:43:26.146421+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing down : swp6
2017-09-19T11:43:26.171454+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing down : swp7
2017-09-19T11:43:26.193387+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing down : swp8
64 bytes from 8.8.8.8: icmp_seq=1623 ttl=59 time=0.524 ms
64 bytes from 8.8.8.8: icmp_seq=1624 ttl=59 time=0.782 ms
64 bytes from 8.8.8.8: icmp_seq=1625 ttl=59 time=0.847 ms
Request timeout for icmp_seq 1626
Request timeout for icmp_seq 1627
Request timeout for icmp_seq 1628
Request timeout for icmp_seq 1629
Request timeout for icmp_seq 1630
Request timeout for icmp_seq 1631
Request timeout for icmp_seq 1632
Request timeout for icmp_seq 1633
Request timeout for icmp_seq 1634
Request timeout for icmp_seq 1635
Request timeout for icmp_seq 1636
Request timeout for icmp_seq 1637
Request timeout for icmp_seq 1638
64 bytes from 8.8.8.8: icmp_seq=1639 ttl=59 time=0.701 ms
64 bytes from 8.8.8.8: icmp_seq=1640 ttl=59 time=0.708 ms
64 bytes from 8.8.8.8: icmp_seq=1641 ttl=59 time=0.780 ms
64 bytes from 8.8.8.8: icmp_seq=1642 ttl=59 time=0.781 ms

Impact of reconnecting the peerlink from the server perspective:

root@leaf-01-c:/home/cumulus# grep ifplugd /var/log/syslog
[...]
2017-09-19T11:48:22.190187+00:00 leaf-01-c ifplugd(peerlink)[5292]: Link beat detected.
2017-09-19T11:48:22.290481+00:00 leaf-01-c ifplugd(peerlink)[5292]: Executing '/etc/ifplugd/ifplugd.action peerlink up'.
2017-09-19T11:48:22.524673+00:00 leaf-01-c ifplugd(peerlink)[5292]: Program executed successfully.

root@leafsw-f24-02-c:/home/cumulus# grep ifplugd /var/log/syslog
[...]
2017-09-19T11:48:22.084477+00:00 leaf-02-c ifplugd(peerlink)[12600]: Link beat detected.
2017-09-19T11:48:22.232192+00:00 leaf-02-c ifplugd(peerlink)[12600]: Executing '/etc/ifplugd/ifplugd.action peerlink up'.
2017-09-19T11:48:22.812771+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing up : swp5
2017-09-19T11:48:22.816175+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing up : swp6
2017-09-19T11:48:22.831487+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing up : swp7
2017-09-19T11:48:22.836617+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing up : swp8
64 bytes from 8.8.8.8: icmp_seq=24 ttl=59 time=0.614 ms
64 bytes from 8.8.8.8: icmp_seq=25 ttl=59 time=0.680 ms
64 bytes from 8.8.8.8: icmp_seq=26 ttl=59 time=8.932 ms
64 bytes from 8.8.8.8: icmp_seq=27 ttl=59 time=1.126 ms
64 bytes from 8.8.8.8: icmp_seq=28 ttl=59 time=2.424 ms
Request timeout for icmp_seq 29
Request timeout for icmp_seq 30
Request timeout for icmp_seq 31
Request timeout for icmp_seq 32
Request timeout for icmp_seq 33
Request timeout for icmp_seq 34
Request timeout for icmp_seq 35
64 bytes from 8.8.8.8: icmp_seq=36 ttl=59 time=6.491 ms
64 bytes from 8.8.8.8: icmp_seq=37 ttl=59 time=1.045 ms
64 bytes from 8.8.8.8: icmp_seq=38 ttl=59 time=1.244 ms

Yes, it takes a few seconds for your server to reconnect if you have a peerlink failure but it is very important to keep the datacenter network operational.

For more information have a look at the Cumulus Linux documentation: https://docs.cumulusnetworks.com/display/DOCS/ifplugd

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.