VXLAN – techbloc.net

This article is about OpenShift networking in general but I also want to look at the Kubernetes CNI feature NetworkPolicy in a bit more detail. The latest OpenShift version 3.11 comes with three SDN deployment models:

ovs-subnet – This creates a single large vxlan between all the namespace and everyone is able to talk to each other.
ovs-multitenant – As the name already says this separates the namespaces into separate vxlan’s and only resources within the namespace are able to talk to each other. You have the possibility to join or making namespaces global.
ovs-networkpolicy – The newest SDN deployment method for OpenShift to enabling micro-segmentation to control the communication between pods and namespaces.
ovs-ovn – Next generation SDN for OpenShift but not yet officially released for OpenShift. For more information visit the OpenvSwitch Github repository ovn-kubernetes.

Here an overview of the common ovs-multitenant software defined network:

On an OpenShift node the tun0 interfaces owns the default gateway and is forwarding traffic to external endpoints outside the OpenShift platform or routing internal traffic to the openvswitch overlay. Both openvswitch and iptables are central components which are very important for the networking on the platform.

Read the official OpenShift documentation managing networking or configuring the SDN for more information.

NetworkPolicy in Action

Let me first explain the example I use to test NetworkPolicy. We will have one hello-openshift pod behind service, and a busybox pod for testing the internal communication. I will create a default ingress deny policy and specifically allow tcp port 8080 to my hello-openshift pod. I am not planning to restrict the busybox pod with an egress policy, so all egress traffic is allowed.

Here you find the example yaml files to replicate the layout: busybox.yml and hello-openshift.yml

Short recap about Kubernetes service definition, they are just simple iptables entries and for this reason you cannot restrict them with NetworkPolicy.

[root@master1 ~]# iptables-save | grep 172.30.231.77
-A KUBE-SERVICES ! -s 10.128.0.0/14 -d 172.30.231.77/32 -p tcp -m comment --comment "myproject/hello-app-http:web cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 172.30.231.77/32 -p tcp -m comment --comment "myproject/hello-app-http:web cluster IP" -m tcp --dport 80 -j KUBE-SVC-LFWXBQW674LJXLPD
[root@master1 ~]#

When you install OpenShift with ovs-networkpolicy, the default policy allows all traffic within a namespace. Let’s do a first test without a custom NetworkPolicy rule to see if I am able to connect to my hello-app-http service.

[root@master1 ~]# oc exec busybox-1-wn592 -- wget -S --spider http://hello-app-http
Connecting to hello-app-http (172.30.231.77:80)
  HTTP/1.1 200 OK
  Date: Tue, 19 Feb 2019 13:59:04 GMT
  Content-Length: 17
  Content-Type: text/plain; charset=utf-8
  Connection: close

[root@master1 ~]#

Now we add a default ingress deny policy to the namespace:

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: deny-all-ingress
spec:
  podSelector:
  ingress: []

After applying the default deny policy you are not able to connect to the hello-app-http service. The connection is timing out because no flows entries are defined yet in the OpenFlow table:

[root@master1 ~]# oc exec busybox-1-wn592 -- wget -S --spider http://hello-app-http
Connecting to hello-app-http (172.30.231.77:80)
wget: can't connect to remote host (172.30.231.77): Connection timed out
command terminated with exit code 1
[root@master1 ~]#

Let’s add a new policy and allow tcp port 8080 and specifying a podSelector to match all pods with the label “role: web”.

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: allow-tcp8080
spec:
  podSelector:
    matchLabels:
      role: web
  ingress:
  - ports:
    - protocol: TCP
      port: 8080

This alone doesn’t do anything, you still need to patch the deployment config and add the label “role: web” to your deployment config metadata information.

oc patch dc/hello-app-http --patch '{"spec":{"template":{"metadata":{"labels":{"role":"web"}}}}}'

To rollback the previous changes simply use the ‘oc rollback dc/hello-app-http’ command.

Now let’s check the openvswitch flow table and you will see that a new flow got added with the destination of my hello-openshift pod 10.128.0.103 on port 8080.

Afterwards we try again to connect to my hello-app-http service and you see that we get a succesful connect:

[root@master1 ~]# oc exec ovs-q4p8m -n openshift-sdn -- ovs-ofctl -O OpenFlow13 dump-flows br0 | grep '10.128.0.103.*8080'
 cookie=0x0, duration=221.251s, table=80, n_packets=15, n_bytes=1245, priority=150,tcp,reg1=0x2dfc74,nw_dst=10.128.0.103,tp_dst=8080 actions=output:NXM_NX_REG2[]
[root@master1 ~]#

[root@master1 ~]# oc exec busybox-1-wn592 -- wget -S --spider http://hello-app-http
Connecting to hello-app-http (172.30.231.77:80)
  HTTP/1.1 200 OK
  Date: Tue, 19 Feb 2019 14:21:57 GMT
  Content-Length: 17
  Content-Type: text/plain; charset=utf-8
  Connection: close

[root@master1 ~]#

The hello openshift container publishes two tcp ports 8080 and 8888, so finally let’s try to connect to the pod IP address on port 8888, and we will find out that I am not able to connect, the reason is that I only allowed 8080 in the policy.

[root@master1 ~]# oc exec busybox-1-wn592 -- wget -S --spider http://10.128.0.103:8888
Connecting to 10.128.0.103:8888 (10.128.0.103:8888)
wget: can't connect to remote host (10.128.0.103): Connection timed out
command terminated with exit code 1
[root@master1 ~]#

There are great posts on the RedHat OpenShift blog which you should checkout networkpolicies-and-microsegmentation and openshift-and-network-security-zones-coexistence-approaches. Otherwise I can recommend having a look at Ahmet Alp Balkan Github repository about Kubernetes network policy recipes, where you can find some good examples.

I did some updates on my Cumulus Linux Vagrant topology and added new functions to my post about an Ansible Playbook for the Cumulus Linux BGP IP-Fabric.

To the Vagrant topology, I added 6x servers and per clag-pair, each server is connected to a VLAN and the second server is connected to a VXLAN.

Here are the links to the repositories where you find the Ansible Playbook https://github.com/berndonline/cumulus-lab-provision and the Vagrantfile https://github.com/berndonline/cumulus-lab-vagrant

In the Ansible Playbook, I added BGP EVPN and one VXLAN which spreads over all Leaf and Edge switches. VXLAN routing is happening on the Edge switches into the rest of the virtual data centre network.

Here is an example of the additional variables I added to edge-1 for BGP EVPN and VXLAN:

group_vars/edge.yml:

clagd_vxlan_anycast_ip: 10.255.100.1

The VXLAN anycast IP is needed in BGP for EVPN and the same IP is shared between edge-1 and edge-2. The same is for the other leaf switches, per clag pair they share the same anycast IP address.

host_vars/edge-1.yml:

---

loopback: 10.255.0.3/32

bgp_fabric:
  asn: 65001
  router_id: 10.255.0.3
  neighbor:
    - swp51
    - swp52
  networks:
    - 10.0.4.0/24
    - 10.255.0.3/32
    - 10.255.100.1/32
    - 10.0.255.0/28
  evpn: true
  advertise_vni: true

peerlink:
  bond_slaves: swp53 swp54
  mtu: 9216
  vlan: 4094
  address: 169.254.1.1/30
  clagd_peer_ip: 169.254.1.2
  clagd_backup_ip: 192.168.100.4
  clagd_sys_mac: 44:38:39:FF:40:94
  clagd_priority: 4096

bridge:
  ports: peerlink vxlan10201
  vids: 901 201

vlans:
  901:
    alias: edge-transit-901
    vipv4: 10.0.255.14/28
    vmac: 00:00:5e:00:09:01
    pipv4: 10.0.255.12/28
  201:
    alias: prod-server-10201
    vipv4: 10.0.4.254/24
    vmac: 00:00:00:00:02:01
    pipv4: 10.0.4.252/24
    vlan_id: 201
    vlan_raw_device: bridge

vxlans:
  10201:
    alias: prod-server-10201
    vxlan_local_tunnelip: 10.255.0.3
    bridge_access: 201
    bridge_learning: 'off'
    bridge_arp_nd_suppress: 'on'

On the Edge switches, because of VXLAN routing, you find a mapping between VXLAN 10201 to VLAN 201 which has VRR running.

I needed to do some modifications to the interfaces template interfaces_config.j2:

{% if loopback is defined %}
auto lo
iface lo inet loopback
    address {{ loopback }}
{% if clagd_vxlan_anycast_ip is defined %}
    clagd-vxlan-anycast-ip {{ clagd_vxlan_anycast_ip }}
{% endif %}
{% endif %}
...
{% if bridge is defined %}
{% for vxlan_id, value in vxlans.items() %}
auto vxlan{{ vxlan_id }}
iface vxlan{{ vxlan_id }}
    alias {{ value.alias }}
    vxlan-id {{ vxlan_id }}
    vxlan-local-tunnelip {{ value.vxlan_local_tunnelip }}
    bridge-access {{ value.bridge_access }}
    bridge-learning {{ value.bridge_learning }}
    bridge-arp-nd-suppress {{ value.bridge_arp_nd_suppress }}
    mstpctl-bpduguard yes
    mstpctl-portbpdufilter yes

{% endfor %}
{% endif %}

There were also some modifications needed to the FRR template frr.j2 to add EVPN to the BGP configuration:

...
{% if bgp_fabric.evpn is defined %}
 address-family ipv6 unicast
  neighbor fabric activate
 exit-address-family
 !
 address-family l2vpn evpn
  neighbor fabric activate
{% if bgp_fabric.advertise_vni is defined %}
  advertise-all-vni
{% endif %}
 exit-address-family
{% endif %}
{% endif %}
...

For more detailed information about EVPN and VXLAN routing on Cumulus Linux, I recommend reading the documentation Ethernet Virtual Private Network – EVPN and VXLAN Routing.

Have fun testing the new features in my Ansible Playbook and please share your feedback.

Tag: VXLAN

OpenShift Networking and Network Policies

BGP EVPN and VXLAN with Cumulus Linux