This blog post is about link-state monitoring under Cumulus Linux. Cumulus has no own builtin tool for this and recommends using ifplugd. The tool has some similarities to Cisco’s IP SLA which can track the state of interfaces.
The main reason to use ifplugd is for split-brain scenarios, when you lose the peerlink between Cumulus Linux CLAG pairs. If the peerlink goes down the CLAG primary switch stays active member and the secondary would automatically disable all CLAG bonds to force the connected servers to failover to the CLAG primary switch to keep the network operational.
Very important you need to configure clagd-backup-ip because this is needed for Cumulus Linux to still be able to communicate to it’s neighbour if they lose the peerlink.
Now ifplugd is important for all connected servers which are not using CLAG bonds, basically servers which are using the normal active/standby teaming which doesn’t require a CLAG bonding configuration. These ports are configured as normal access ports, so an peerlink failure would normally keep these ports up if you don’t configure ifplugd.
Ifplugd needs to be installed and configured on both switches running CLAG, follow the steps below.
Install ifplugd service:
sudo apt-get update sudo apt-get install ifplugd
Edit the file /etc/default/ifplugd and add the lines below
The delay is set to -d10 moderate 10 seconds because of combination with CLAG. Need to see and lower the value over time.
INTERFACES="peerlink" HOTPLUG_INTERFACES="" ARGS="-q -f -u0 -d10 -w -I" SUSPEND_ACTION="stop"
Edit the file /etc/ifplugd/action.d/ifupdown
The variable $SWITCHPORTS defines which ports ifplugd should shutdown if the peerlink goes down. We came up with to use a custom variable instead of shutting down all ports because CLAG is taking care of configured bonds.
#!/bin/sh # The peerlink bond interface PEERLINK=peerlink # The switchports to bring down on peerlink failure # # enslosures 01/02: swp5..swp8 SWITCHPORTS=$(seq -f swp%g 5 8) # storage system 01/02 : swp19..swp22 SWITCHPORTS="$SWITCHPORTS $(seq -f swp%g 19 22)" # server1/server2: swp27..swp28 SWITCHPORTS="$SWITCHPORTS $(seq -f swp%g 27 28)" # VMware cluster: swp35..swp38 SWITCHPORTS="$SWITCHPORTS $(seq -f swp%g 35 38)" case "$1" in "$PEERLINK") clagrole=$(clagctl | grep "Our Priority" | awk '{print $8}') case "$2" in up | down) action=$2 if [ "$clagrole" = "secondary" ]; then for interface in $SWITCHPORTS; do echo "bringing $action : $interface" ip link set $interface $action done fi ;; esac ;; esac
Start ifplugd service
sudo systemctl restart ifplugd.service
Impact of a simulated peerlink failure from the server perspective:
2017-09-19T11:43:15.665057+00:00 leaf-01-c ifplugd(peerlink)[5292]: Link beat lost. 2017-09-19T11:43:25.775585+00:00 leaf-01-c ifplugd(peerlink)[5292]: Executing '/etc/ifplugd/ifplugd.action peerlink down'. 2017-09-19T11:43:25.902637+00:00 leaf-01-c ifplugd(peerlink)[5292]: Program executed successfully. root@leaf-01-c:/home/cumulus# root@leaf-02-c:/home/cumulus# grep ifplugd /var/log/syslog [...] 2017-09-19T11:43:15.780727+00:00 leaf-02-c ifplugd(peerlink)[12600]: Link beat lost. 2017-09-19T11:43:25.891584+00:00 leaf-02-c ifplugd(peerlink)[12600]: Executing '/etc/ifplugd/ifplugd.action peerlink down'. 2017-09-19T11:43:26.107140+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing down : swp5 2017-09-19T11:43:26.146421+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing down : swp6 2017-09-19T11:43:26.171454+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing down : swp7 2017-09-19T11:43:26.193387+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing down : swp8
64 bytes from 8.8.8.8: icmp_seq=1623 ttl=59 time=0.524 ms 64 bytes from 8.8.8.8: icmp_seq=1624 ttl=59 time=0.782 ms 64 bytes from 8.8.8.8: icmp_seq=1625 ttl=59 time=0.847 ms Request timeout for icmp_seq 1626 Request timeout for icmp_seq 1627 Request timeout for icmp_seq 1628 Request timeout for icmp_seq 1629 Request timeout for icmp_seq 1630 Request timeout for icmp_seq 1631 Request timeout for icmp_seq 1632 Request timeout for icmp_seq 1633 Request timeout for icmp_seq 1634 Request timeout for icmp_seq 1635 Request timeout for icmp_seq 1636 Request timeout for icmp_seq 1637 Request timeout for icmp_seq 1638 64 bytes from 8.8.8.8: icmp_seq=1639 ttl=59 time=0.701 ms 64 bytes from 8.8.8.8: icmp_seq=1640 ttl=59 time=0.708 ms 64 bytes from 8.8.8.8: icmp_seq=1641 ttl=59 time=0.780 ms 64 bytes from 8.8.8.8: icmp_seq=1642 ttl=59 time=0.781 ms
Impact of reconnecting the peerlink from the server perspective:
root@leaf-01-c:/home/cumulus# grep ifplugd /var/log/syslog [...] 2017-09-19T11:48:22.190187+00:00 leaf-01-c ifplugd(peerlink)[5292]: Link beat detected. 2017-09-19T11:48:22.290481+00:00 leaf-01-c ifplugd(peerlink)[5292]: Executing '/etc/ifplugd/ifplugd.action peerlink up'. 2017-09-19T11:48:22.524673+00:00 leaf-01-c ifplugd(peerlink)[5292]: Program executed successfully. root@leafsw-f24-02-c:/home/cumulus# grep ifplugd /var/log/syslog [...] 2017-09-19T11:48:22.084477+00:00 leaf-02-c ifplugd(peerlink)[12600]: Link beat detected. 2017-09-19T11:48:22.232192+00:00 leaf-02-c ifplugd(peerlink)[12600]: Executing '/etc/ifplugd/ifplugd.action peerlink up'. 2017-09-19T11:48:22.812771+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing up : swp5 2017-09-19T11:48:22.816175+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing up : swp6 2017-09-19T11:48:22.831487+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing up : swp7 2017-09-19T11:48:22.836617+00:00 leaf-02-c ifplugd(peerlink)[12600]: client: bringing up : swp8
64 bytes from 8.8.8.8: icmp_seq=24 ttl=59 time=0.614 ms 64 bytes from 8.8.8.8: icmp_seq=25 ttl=59 time=0.680 ms 64 bytes from 8.8.8.8: icmp_seq=26 ttl=59 time=8.932 ms 64 bytes from 8.8.8.8: icmp_seq=27 ttl=59 time=1.126 ms 64 bytes from 8.8.8.8: icmp_seq=28 ttl=59 time=2.424 ms Request timeout for icmp_seq 29 Request timeout for icmp_seq 30 Request timeout for icmp_seq 31 Request timeout for icmp_seq 32 Request timeout for icmp_seq 33 Request timeout for icmp_seq 34 Request timeout for icmp_seq 35 64 bytes from 8.8.8.8: icmp_seq=36 ttl=59 time=6.491 ms 64 bytes from 8.8.8.8: icmp_seq=37 ttl=59 time=1.045 ms 64 bytes from 8.8.8.8: icmp_seq=38 ttl=59 time=1.244 ms
Yes, it takes a few seconds for your server to reconnect if you have a peerlink failure but it is very important to keep the datacenter network operational.
For more information have a look at the Cumulus Linux documentation: https://docs.cumulusnetworks.com/display/DOCS/ifplugd