Cumulus Linux Snapshot Rollback

After my first post about Cumulus Linux non-disruptive upgrade procedure on MLAG pairs here the rollback procedure which is good to know in case you need to revert a snapshot after an unsuccessful software upgrade or bigger configuration change which went wrong.

To back out from an upgrade, you can roll back the state of your switch (software and configuration) to an earlier snapshot.

The rollback is for sure disruptive!

The Rollback will revert the entire system except for logs and home directories. So any configuration changes made after the upgrade will also be reverted and lost.

To create a custom snapshot use the following command, otherwise during the software upgrade an snapshot is created automatically.

cumulus@switch:~$ sudo snapper create -d SNAPSHOT_NAME

To rollback a snapshot perform the following steps on each switch. View the list of snapshots on the switch using the following command

sudo snapper list

You will see output similar to the following:

cumulus@switch:~$ sudo snapper list
Type   | #  | Pre # | Date                            | User | Cleanup | Description                            | Userdata    
-------+----+-------+---------------------------------+------+---------+----------------------------------------+--------------
single | 0  |       |                                 | root |         | current                                |             
single | 1  |       | Sat 24 Sep 2016 01:45:36 AM UTC | root |         | first root filesystem                  |             
pre    | 20 |       | Thu 01 Dec 2016 01:43:29 AM UTC | root | number  | nclu pre  'net commit' (user cumulus)  |             
post   | 21 | 20    | Thu 01 Dec 2016 01:43:31 AM UTC | root | number  | nclu post 'net commit' (user cumulus)  |             
pre    | 22 |       | Thu 01 Dec 2016 01:44:18 AM UTC | root | number  | nclu pre  '20 rollback' (user cumulus) |             
post   | 23 | 22    | Thu 01 Dec 2016 01:44:18 AM UTC | root | number  | nclu post '20 rollback' (user cumulus) |             
single | 26 |       | Thu 01 Dec 2016 11:23:06 PM UTC | root |         | test_snapshot                          |             
pre    | 29 |       | Thu 01 Dec 2016 11:55:16 PM UTC | root | number  | pre-apt                                | important=yes
post   | 30 | 29    | Thu 01 Dec 2016 11:55:21 PM UTC | root | number  | post-apt                               | important=yes

You want to locate the “pre-apt” snapshot that corresponds to the date and time of the upgrade.  Once you have identified the snapshot, note the number from the second column of the table (#).

Determine the number on each switch and don’t assume the number is the same on both switches if you need to rollback.

For doing a rollback use the following command

sudo snapper rollback NUMBER#

You will see the following output:

cumulus@switch:~$ sudo snapper rollback 29
Creating read-only snapshot of current system. (Snapshot 31.)
Creating read-write snapshot of snapshot 29. (Snapshot 32.)
Setting default subvolume to snapshot 32.
cumulus@switch:~$

Note the snapshot number 31. reported for the “read-only snapshot of the current system”.   You can use this to revert the rollback if needed.

Reboot the system with the command

sudo reboot

More information you can find in the Cumulus Linux documentation:

https://docs.cumulusnetworks.com/display/DOCS/Using+Snapshots

Cumulus Linux non-disruptive upgrade procedure on MLAG pairs

I thought it would be useful to know the exact procedure for non-disruptive upgrade on Cumulus Linux MLAG – CLAG pairs. I find the online documentation Upgrading Cumulus Linux a bit short when it comes to running CLAG in what order you have to upgrade the switches with a minimal disruption of traffic..

The following procedure below worked for me on Dell S4048-ON and Dell S3048-ON switches

  • On both switches, run the following command to refresh the package index of the apt repository:
sudo apt-get update
  • Run the following command to determine which switch is CLAG primary- and which switch CLAG secondary:
sudo net show clag

Start upgrading the secondary CLAG member:

  • Shutdown on all interfaces except the peerlink using the commands below.  This will force all traffic through the other switch:
echo swp{1..52} | tr ' ' '\n' | sudo xargs -i ip link set {} down
64 bytes from 8.8.8.8: icmp_seq=8903 ttl=59 time=1.106 ms
64 bytes from 8.8.8.8: icmp_seq=8904 ttl=59 time=0.974 ms
64 bytes from 8.8.8.8: icmp_seq=8905 ttl=59 time=1.643 ms
64 bytes from 8.8.8.8: icmp_seq=8906 ttl=59 time=0.869 ms
Request timeout for icmp_seq 8907
64 bytes from 8.8.8.8: icmp_seq=8908 ttl=59 time=1.256 ms
64 bytes from 8.8.8.8: icmp_seq=8909 ttl=59 time=0.769 ms

(Rollback) If problems are seen revert the change, the commands shown:

echo swp{1..52} | tr ' ' '\n' | sudo xargs -i ip link set {} up

Wait one minute for CLAG to stabilise and verify network communication with the remaining switch.

  • Perform a clean shutdown of clagd on this switch
sudo systemctl stop clagd

(Rollback) If you see problems start clagd again:

sudo systemctl start clagd

Wait one minute for CLAG to cleanly shut down

  • Shutdown peerlink bond
sudo ip link set peerlink down

(Rollback) If you see problems enable peerlink again:

sudo ip link set peerlink up
  • Perform the upgrade using the command:
sudo apt-get upgrade

The reason why it is important to do a clean shutdown of all the ports is that the bridge and peerlink bounces during the package upgrade which could affect the network communication if this happens uncontrolled.

  • Reboot the switch using the command
sudo reboot

Wait for the upgraded switch to come up. This will cause a short outage in traffic.

64 bytes from 8.8.8.8: icmp_seq=9443 ttl=59 time=1.069 ms
64 bytes from 8.8.8.8: icmp_seq=9444 ttl=59 time=1.150 ms
64 bytes from 8.8.8.8: icmp_seq=9445 ttl=59 time=0.993 ms
64 bytes from 8.8.8.8: icmp_seq=9446 ttl=59 time=1.331 ms
Request timeout for icmp_seq 9447
Request timeout for icmp_seq 9448
Request timeout for icmp_seq 9449
64 bytes from 8.8.8.8: icmp_seq=9450 ttl=59 time=1.539 ms
64 bytes from 8.8.8.8: icmp_seq=9451 ttl=59 time=0.908 ms
64 bytes from 8.8.8.8: icmp_seq=9452 ttl=59 time=1.166 ms
64 bytes from 8.8.8.8: icmp_seq=9453 ttl=59 time=1.261 ms

Wait until the network is functioning normally again.

On the secondary, run the following command to take over the primary role:

sudo clagctl priority 0

Wait one minute for CLAG to failvoer

Verify that the CLAG handover has occurred:

sudo net show clag

Repeat steps on the new secondary (old primary) to shutdown all interfaces.

Ones finished you need to reset the clag priority on the primary to its configured default value.

Read my next post, how to rollback if an upgrade failed: Cumulus Linux Snapshot Rollback