OpenShift Container Platform Troubleshooting Guide

On the first look OpenShift/Kubernetes seems like a very complex platform but once you start to get to know the different components and what they are doing, you will see it gets easier and easier. The purpose of this article to give you an every day guide based on my experience on how to successfully troubleshoot issues on OpenShift.

  • OpenShift service logging
# OpenShift 3.1 to OpenShift 3.9:
/etc/sysconfig/atomic-openshift-master-controllers
/etc/sysconfig/atomic-openshift-master-api
/etc/sysconfig/atomic-openshift-node

# OpenShift 3.10 and later versions:
/etc/origin/master/master.env # for API and Controllers
/etc/sysconfig/atomic-openshift-node

The log levels for the OpenShift services can be controlled via the –loglevel parameter in the service options.

0 – Errors and warnings only
2 – Normal information
4 – Debugging information
6 – API- debugging information (request / response)
8 – Body API debugging information

For example add or edit the line in /etc/sysconfig/atomic-openshift-node to OPTIONS=’–loglevel=4′ and afterward restart the service with systemctl restart atomic-openshift-node.

Viewing OpenShift service logs:

# OpenShift 3.1 to OpenShift 3.9:
journalctl -u atomic-openshift-master-api
journalctl -u atomic-openshift-master-controllers
journalctl -u atomic-openshift-node
journalctl -u etcd # or 'etcd_container' for containerized install

# OpenShift 3.10 and later versions:
/usr/local/bin/master-logs api api
/usr/local/bin/master-logs controllers controllers
/usr/local/bin/master-logs etcd etcd
journalctl -u atomic-openshift-node
  • Docker service logging

Change the docker daemon log level and add the parameter –log-level for the OPTIONS variable in dockers service file located in /etc/sysconfig/docker.

The available log levels are: ( debug, info, warn, error, fatal )

See the example below; to enable debug logging in /etc/sysconfig/docker to set log level equal to debug (After making the changes on the docker service you need to will restart with systemctl restart docker.):

OPTIONS='--insecure-registry=172.30.0.0/16 --selinux-enabled --log-level=debug'
  • OC command logging

The oc and oadm command also accept a loglevel option that can help get additional information. Value between 6 and 8 will provide extensive logging, API requests (loglevel 6), API headers (loglevel 7) and API responses received (loglevel 8):

oc whoami --loglevel=8
  • OpenShift SkyDNS

SkyDNS is the internal service discovery for OpenShift and DNS is important for OpenShift to function:

# Test full qualified cluster domain name
nslookup docker-registry.default.svc.cluster.local
# OR
dig +short docker-registry.default.svc.cluster.local

# Check if clusterip match the previous result
oc get svc/docker-registry -n default

# Test short name
nslookup docker-registry.default.svc
nslookup <endpoint-name>.<project-name>.svc

If short name doesn’t work look out if cluster.local is missing in dns search suffix. If resolution doesn’t work at all before enable debug logging, check if Dnsmasq service running and correctly configured. OpenShift uses a dispatcher script to maintain the DNS configuration of a node.

Add the options “–logspec ‘dns=10’” to the /etc/sysconfig/atomic-openshift-node service configuration on a node running skydns and restart the atomic-openshift-node service afterwards. There will then be skydns debug information in the journalctl logs.

OPTIONS="--loglevel=2 --logspec dns*=10"
  • OpenShift Master API and Web Console

In the following example, the internal-master.domain.com is used by the internal cluster, and the master.domain.com is used by external clients

# Run the following commands on any node host
curl https://internal-master.domain.com:443/version
curl -k https://master.domain.com:443/healthz

# The OpenShift API service runs on all master instances. To see the status of the service, view the master-api pods in the kube-system project:
oc get pod -n kube-system -l openshift.io/component=api
oc get pod -n kube-system -o wide
curl -k --insecure https://$HOSTNAME:8443/healthz
  • OpenShift Controller role

The OpenShift Container Platform controller service is available on all master nodes. The service runs in active/passive mode, which means it should only be running on one master.

# Verify the master host running the controller service
oc get -n kube-system cm openshift-master-controllers -o yaml
    • OpenShift Certificates

During the installation of OpenShift the playbooks generates a CA to sign every certificate in the cluster. One of the most common issues are expired node certificates. Below are a list of important certificate files:

# Is the OpenShift Certificate Authority, and it signs every other certificate unless specified otherwise.
/etc/origin/master/ca.crt

# Contains a bundle with the current and the old CA's (if exists) to trust them all. If there has been only one ca.crt, then this file is the same as ca.crt.
/etc/origin/master/ca-bundle.crt

# The internal API, also known as cluster internal address or the variable masterURL here all the internal components authenticates to access the API, such as nodes, routers and other services.
/etc/origin/master/master.server.crt

# Master controller certificate authenticates to kubernetes as a client using the admin.kubeconfig
/etc/origin/master/admin.crt

# Node certificates
/etc/origin/node/ca.crt                   # to be able to trust the API, a copy of masters CA bundle is placed in:
/etc/origin/node/server.crt               # to secure this communication
/etc/origin/node/system:node:{fqdn}.crt   # Nodes needs to authenticate to the Kubernetes API as a client. 

# Etcd certificates
/etc/etcd/ca.crt                          # is the etcd CA, it is used to sign every certificate.
/etc/etcd/server.crt                      # is used by the etcd to listen to clients.
/etc/etcd/peer.crt                        # is used by etcd to authenticate as a client.

# Master certificates to auth to etcd
/etc/origin/master/master.etcd-ca.crt     # is a copy of /etc/etcd/ca.crt. Used to trust the etcd cluster.
/etc/origin/master/master.etcd-client.crt # is used to authenticate as a client of the etcd cluster.

# Services ca certificate. All self-signed internal certificates are signed by this CA.
/etc/origin/master/service-signer.crt

Here’s an example to check the validity of the master server certificate:

cat /etc/origin/master/master.server.crt | openssl x509 -text | grep -i Validity -A2
# OR
openssl x509 -enddate -noout -in /etc/origin/master/master.server.crt

It’s worth checking the documentation about how to re-deploy certificates on OpenShift.

  • OpenShift etcd

On the etcd node (master) set source to etcd.conf file for most of the needed variables.

source /etc/etcd/etcd.conf
export ETCDCTL_API=3

# Set endpoint variable to include all etcd endpoints
ETCD_ALL_ENDPOINTS=` etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS --write-out=fields   member list | awk '/ClientURL/{printf "%s%s",sep,$3; sep=","}'`

# Cluster status and health checks
etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS --write-out=table  member list
etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_ALL_ENDPOINTS  --write-out=table endpoint status
etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_ALL_ENDPOINTS endpoint health

Check etcd database key entries:

etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints="https://$(hostname):2379" get /openshift.io --prefix --keys-only
  • OpenShift Registry

To get detailed information about the pods running the internal registry run the following command:

oc get pods -n default | grep registry | awk '{ print $1 }' | xargs -i oc describe pod {}

For a basic health check that the internal registry is running and responding you need to “curl” the /healthz path. Normally this should return a 200 HTTP response:

Registry=$(oc get svc docker-registry -n default -o 'jsonpath={.spec.clusterIP}:{.spec.ports[0].port}')

curl -vk $Registry/healthz
# OR
curl -vk https://$Registry/healthz

If a persistent volume is attached to the registry make sure that the registry can write to the volume.

oc project default 
oc rsh `oc get pods -o name -l docker-registry`

$ touch /registry/test-file
$ ls -la /registry/ 
$ rm /registry/test-file
$ exit

If the registry is insecure make sure you have edited the /etc/sysconfig/docker file and add –insecure-registry 172.30.0.0/16 to the OPTIONS parameter on the nodes.

For more information about testing the internal registry please have a look at the documentation about Accessing the Registry.

  • OpenShift Router 

To increase the log level for OpenShift router pod, set loglevel=4 in the container args:

# Increase logging level
oc patch dc -n default router -p '[{"op": "add", "path": "/spec/template/spec/containers/0/args", "value":["--loglevel=4"]}]' --type=json 

# View logs
oc logs <router-pod-name> -n default

# Remove logging change 
oc patch dc -n default router -p '[{"op": "remove", "path": "/spec/template/spec/containers/0/args", "value":["--loglevel=4"]}]' --type=json

OpenShift router image version 3.3 and later, the logging for http requests can be forwarded to an external syslog server:

oc set env dc/router ROUTER_SYSLOG_ADDRESS=<syslog-server-ip> ROUTER_LOG_LEVEL=debug

If you are facing issues with ingress routes to your application run the below command to collect more information:

oc logs dc/router -n default
oc get dc/router -o yaml -n default
oc get route <route-name> -n <project-name> 
oc get endpoints --all-namespaces 
oc exec -it <router-pod-name> -- ls -la 
oc exec -it <router-pod-name> -- find /var/lib/haproxy -regex ".*\(.map\|config.*\|.json\)" -print -exec cat {} \; > haproxy_configs_and_maps

Check if your application domain is /paas.domain.com/ and dig for an ANSWER containing the load balancer VIP address:

dig \*.paas.domain.com

Confirm that certificates are being severed out correctly by running the following:

echo -n | openssl s_client -connect :443 -servername myapp.paas.domain.com 2>&1 | openssl x509 -noout -text
curl -kv https://myapp.paas.domain.com 
  • OpenShift SDN

Please checkout the official Troubleshooting OpenShift SDN documentation

To get OpenFlow table export, connect to the openvswitch container and run following command:

docker exec openvswitch ovs-ofctl -O OpenFlow13 dump-flows br0
  • OpenShift Namespace events

Useful to collect events from the namespace to identify pod creation issues before you did in the container logs:

oc get events [-n |--all-namespaces]

In the default namespace you find relevant events for monitoring or auditing a cluster, such as Node and resource events related to the OpenShift platform.

  • OpenShift Pod and Container Logs

Container/pod logs can be viewed using the OpenShift oc command line. Add option “-p” to print the logs for the previous instance of the container in a pod if it exists and add option “-f” to stream the logs:

oc logs <pod-name> [-f]

The logs are saved to the worker nodes disk where the container/pod is running and it is located at:
/var/lib/docker/containers/<container-id>/<container-id>-json.log.

For setting the log file limits for containers on a worker node the –log-opt can be configured with max-size and max-file so that a containers logs are rolled over:

# cat /etc/sysconfig/docker 
OPTIONS='--insecure-registry=172.30.0.0/16 --selinux-enabled --log-opt max-size=50m --log-opt max-file=5'

# Restart docker service for the changes to take effect.
systemctl restart docker 

To remove all logs from a given container run the following commands:

cat /dev/null > /var/lib/docker/containers/<container-id>/<container-id>-json.log
# OR
cat /dev/null >  $(docker inspect --format='{{.LogPath}}' <container-id> )

To generate a list of the largest files run the following commands:

# Log files
find /var/lib/docker/ -name "*.log" -exec ls -sh {} \; | sort -n -r | head -20

# All container files
du -aSh /var/lib/docker/ | sort -n -r | head -n 10

Finding out the veth# interface of a docker container and use tcpdump to capture traffic more easily. The iflink of the container is the same as the ifindex of the veth#. You can get the iflink of the container as follows:

docker exec -it <container-name>  bash -c 'cat /sys/class/net/eth0/iflink'

# Let's say that the results in 14, then grep for 14
grep -l 14 /sys/class/net/veth*/ifindex

# Which will give a unique result on the worker node
/sys/class/net/veth12c4982/ifindex

Here a simple bash script to get the container and veth id’s:

#!/bin/bash
for container in $(docker ps -q); do
    iflink=`docker exec -it $container bash -c 'cat /sys/class/net/eth0/iflink'`
    iflink=`echo $iflink|tr -d '\r'`
    veth=`grep -l $iflink /sys/class/net/veth*/ifindex`
    veth=`echo $veth|sed -e 's;^.*net/\(.*\)/ifindex$;\1;'`
    echo $container:$veth
done
  • OpenShift Builder Pod Logs

If you want to troubleshoot a particular build of “myapp” you can view logs with:

oc logs [bc/|dc/]<name> [-f]

To increase the logging level add a BUILD_LOGLEVEL environment variable to the BuildConfig strategy:

sourceStrategy:
...
  env:
    - name: "BUILD_LOGLEVEL"
      value: "5"

I hope you found this article useful and that it helped you troubleshoot OpenShift. Please let me know what you think and leave a comment.

Build Ansible Tower Container

After creating my Jenkins container I thought it would be fun to run Ansible Tower in a container so I created a simple Dockerfile. First you need find out the latest Ansible Tower version: https://releases.ansible.com/ansible-tower/setup/ and update the version variable in the Dockerfile.

Here is my Dockerfile:

...
ARG ANSIBLE_TOWER_VER=3.3.1-1
...

The passwords can be changed in the inventory file:

...
[all:vars]
admin_password='<-your-password->'
...
pg_password='<-your-password->'
...
rabbitmq_password='<-your-password->'
...

Let’s start by building the container:

git clone https://github.com/berndonline/ansible-tower-docker.git && cd ansible-tower-docker/
docker build -t berndonline/ansible-tower .

The docker build will take a few minutes, just wait and look out for errors you might have in the build:

berndonline@lab:~$ git clone https://github.com/berndonline/ansible-tower-docker.git
Cloning into 'ansible-tower-docker'...
remote: Enumerating objects: 17, done.
remote: Counting objects: 100% (17/17), done.
remote: Compressing objects: 100% (11/11), done.
remote: Total 17 (delta 4), reused 14 (delta 4), pack-reused 0
Unpacking objects: 100% (17/17), done.
berndonline@lab:~$ cd ansible-tower-docker/
berndonline@lab:~/ansible-tower-docker$ docker build -t berndonline/ansible-tower .
Sending build context to Docker daemon  87.04kB
Step 1/31 : FROM ubuntu:16.04
16.04: Pulling from library/ubuntu
7b8b6451c85f: Pull complete
ab4d1096d9ba: Pull complete
e6797d1788ac: Pull complete
e25c5c290bde: Pull complete
Digest: sha256:e547ecaba7d078800c358082088e6cc710c3affd1b975601792ec701c80cdd39
Status: Downloaded newer image for ubuntu:16.04
 ---> a51debf7e1eb
Step 2/31 : USER root
 ---> Running in cf5d606130cc
Removing intermediate container cf5d606130cc
 ---> d5b11ed84885
Step 3/31 : WORKDIR /opt
 ---> Running in 1e6703cec6db
Removing intermediate container 1e6703cec6db
 ---> 045cf04ebc1d
Step 4/31 : ARG ANSIBLE_TOWER_VER=3.3.1-1
 ---> Running in 6d65bfe370d4
Removing intermediate container 6d65bfe370d4
 ---> d75c246c3a5c
Step 5/31 : ARG PG_DATA=/var/lib/postgresql/9.6/main
 ---> Running in e8856051aa92
Removing intermediate container e8856051aa92
 ---> 02e6d7593df8

...

PLAY [Install Tower isolated node(s)] ******************************************
skipping: no hosts matched

PLAY RECAP *********************************************************************
localhost                  : ok=125  changed=64   unreachable=0    failed=0

The setup process completed successfully.
Setup log saved to /var/log/tower/setup-2018-11-21-20:21:37.log
Removing intermediate container ad6401292444
 ---> 8f1eb28f16cb
Step 27/31 : ADD entrypoint.sh /entrypoint.sh
 ---> 8503e666ce9c
Step 28/31 : RUN chmod +x /entrypoint.sh
 ---> Running in 8b5ca24a320a
Removing intermediate container 8b5ca24a320a
 ---> 60810dc2a4e3
Step 29/31 : VOLUME ["${PG_DATA}", "${AWX_PROJECTS}","/certs"]
 ---> Running in d836e5455bd5
Removing intermediate container d836e5455bd5
 ---> 3968430a1814
Step 30/31 : EXPOSE 80
 ---> Running in 9a72815e365b
Removing intermediate container 9a72815e365b
 ---> 3613ced2a80c
Step 31/31 : ENTRYPOINT ["/entrypoint.sh", "ansible-tower"]
 ---> Running in 4611a90aff1a
Removing intermediate container 4611a90aff1a
 ---> ce89ea0753d4
Successfully built ce89ea0753d4
Successfully tagged berndonline/ansible-tower:latest

Continue to create a Docker Volume container to store the Postgres database:

sudo docker create -v /var/lib/postgresql/9.6/main --name tower-data berndonline/ansible-tower /bin/true

Start the Ansible Tower Docker container:

sudo docker run -d -p 32456:80 --volumes-from tower-data --name ansible-tower --privileged --restart berndonline/ansible-tower

Afterwards you can connect to http://<your-ip-address>:32456/ and import your Tower license. Ansible provides a free 10 node license which you can request here: https://www.ansible.com/license.

The Ansible Tower playbook installs an Nginx reverse proxy and you can enable SSL by setting the variable nginx_disable_https to false in the inventory file, and publish the container via 443 instead of 80.

Please share your feedback and leave a comment.

Build Jenkins Container with Terraform and Ansible

I thought it might be interesting to show how to build a Docker container running Jenkins and tools like Terraform and Ansible. I am planning to use a Jenkins pipeline to deploy my OpenShift 3.11 example on AWS using Terraform and Ansible but more about this in the next post.

I am using the source Dockerfile from Jenkins and modified it, and added Ansible and Terraform: https://github.com/jenkinsci/docker. Below you see a few variables you might need to change depending on the version you are trying to use or where to place the volume mount. Have a look here for the latest Jenkins version: https://updates.jenkins-ci.org/download/war/.

Here is my Dockerfile:

...
ARG JENKINS_HOME=/var/jenkins_home
...
ENV TERRAFORM_VERSION=0.11.10
... 
ARG JENKINS_VERSION=2.151
ENV JENKINS_VERSION $JENKINS_VERSION
...
ARG JENKINS_SHA=a4335cc626c1f64da61a20174af654283d171b255a928bbacb6402a315e213d7
...

Let’s start and clone my Jenkins Docker repository  and run docker build:

git clone https://github.com/berndonline/jenkins-docker.git && cd ./jenkins-docker/
docker build -t berndonline/jenkins .

The docker build will take a few minutes, just wait and look out for error you might have with the build:

berndonline@lab:~/jenkins-docker$ docker build -t berndonline/jenkins .
Sending build context to Docker daemon  141.3kB
Step 1/51 : FROM openjdk:8-jdk
8-jdk: Pulling from library/openjdk
54f7e8ac135a: Pull complete
d6341e30912f: Pull complete
087a57faf949: Pull complete
5d71636fb824: Pull complete
9da6b28682cf: Pull complete
203f1094a1e2: Pull complete
ee38d9f85cf6: Pull complete
7f692fae02b6: Pull complete
eaa976dc543c: Pull complete
Digest: sha256:94bbc3357f995dd37986d8da0f079a9cd4b99969a3c729bad90f92782853dea7
Status: Downloaded newer image for openjdk:8-jdk
 ---> c14ba9d23b3a
Step 2/51 : USER root
 ---> Running in c78f75ca3d5a
Removing intermediate container c78f75ca3d5a
 ---> f2c6bb7538ea
Step 3/51 : RUN apt-get update && apt-get install -y git curl && rm -rf /var/lib/apt/lists/*
 ---> Running in 4cc857e12f50
Ign:1 http://deb.debian.org/debian stretch InRelease
Get:2 http://security.debian.org/debian-security stretch/updates InRelease [94.3 kB]
Get:3 http://deb.debian.org/debian stretch-updates InRelease [91.0 kB]
Get:4 http://deb.debian.org/debian stretch Release [118 kB]
Get:5 http://security.debian.org/debian-security stretch/updates/main amd64 Packages [459 kB]
Get:6 http://deb.debian.org/debian stretch Release.gpg [2434 B]
Get:7 http://deb.debian.org/debian stretch-updates/main amd64 Packages [5152 B]
Get:8 http://deb.debian.org/debian stretch/main amd64 Packages [7089 kB]
Fetched 7859 kB in 1s (5540 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...

...

Step 49/51 : ENTRYPOINT ["/sbin/tini", "--", "/usr/local/bin/jenkins.sh"]
 ---> Running in 28da7c4bf90a
Removing intermediate container 28da7c4bf90a
 ---> f380f1a6f06f
Step 50/51 : COPY plugins.sh /usr/local/bin/plugins.sh
 ---> 82871f0df0dc
Step 51/51 : COPY install-plugins.sh /usr/local/bin/install-plugins.sh
 ---> feea9853af70
Successfully built feea9853af70
Successfully tagged berndonline/jenkins:latest
berndonline@lab:~/jenkins-docker$

The Docker container is successfully build:

berndonline@lab:~/jenkins-docker$ docker images
REPOSITORY                  TAG                 IMAGE ID            CREATED             SIZE
berndonline/jenkins         latest              cd1742c317fa        6 days ago          1.28GB

Let’s start the Docker container:

docker run -d -v /var/jenkins_home:/var/jenkins_home -p 32771:8080 -p 32770:50000 berndonline/jenkins

Quick check that the container is successfully created:

berndonline@lab:~/jenkins-docker$ docker ps
CONTAINER ID        IMAGE                 COMMAND                  CREATED             STATUS              PORTS                                               NAMES
7073fa9c0cd4        berndonline/jenkins   "/sbin/tini -- /usr/…"   5 days ago          Up 7 seconds        0.0.0.0:32771->8080/tcp, 0.0.0.0:32770->50000/tcp   jenkins

Afterwards you can connect to http://<your-ip-address>:32771/ and do the initial Jenkins configuration, like changing admin password and install needed plugins. I recommend putting an Nginx reverse proxy with SSL infront to secure Jenkins properly.

So what about updates or changing the configuration? – Pretty easy; because we are using a Docker bind mount to /var/jenkins_home/, all the Jenkins related data is stored on the local file system of your server and you can re-create or re-build the container at anytime.

I hope you like this article about how to create your down Jenkins Docker container. In my next post I will create a very simple Jenkins pipeline to deploy OpenShift 3.11 on AWS using Terraform.

Please share your feedback and leave a comment.

Deploy OpenShift 3.11 Container Platform on AWS using Terraform

I have done a few changes on my Terraform configuration for OpenShift 3.11 on Amazon AWS. I have downsized the environment because I didn’t needed that many nodes for a quick test setup. I have added CloudFlare DNS to automatically create CNAME for the AWS load balancers on the DNS zone. I have also added an AWS S3 Bucket for storing the backend state. You can find the new Terraform configuration on my Github repository: https://github.com/berndonline/openshift-terraform/tree/aws-dev

From OpenShift 3.10 and later versions the environment variables changes and I modified the ansible-hosts template for the new configuration. You can see the changes in the hosts template: https://github.com/berndonline/openshift-terraform/blob/aws-dev/helper_scripts/ansible-hosts.template.txt

OpenShift 3.11 has changed a few things and put an focus on an Cluster Operator console which is pretty nice and runs on Kubernetes 1.11. I recommend reading the release notes for the 3.11 release for more details: https://docs.openshift.com/container-platform/3.11/release_notes/ocp_3_11_release_notes.html

I don’t wanted to get into too much detail, just follow the steps below and start with cloning my repository, and choose the dev branch:

git clone -b aws-dev https://github.com/berndonline/openshift-terraform.git
cd ./openshift-terraform/
ssh-keygen -b 2048 -t rsa -f ./helper_scripts/id_rsa -q -N ""
chmod 600 ./helper_scripts/id_rsa

You need to modify the cloudflare.tf and add your CloudFlare API credentials otherwise just delete the file. The same for the S3 backend provider, you find the configuration in the main.tf and it can be removed if not needed.

CloudFlare and Amazon AWS credentials can be added through environment variables:

export AWS_ACCESS_KEY_ID='<-YOUR-AWS-ACCESS-KEY->'
export AWS_SECRET_ACCESS_KEY='<-YOUR-AWS-SECRET-KEY->'
export TF_VAR_email='<-YOUR-CLOUDFLARE-EMAIL-ADDRESS->'
export TF_VAR_token='<-YOUR-CLOUDFLARE-TOKEN->'
export TF_VAR_domain='<-YOUR-CLOUDFLARE-DOMAIN->'
export TF_VAR_htpasswd='<-YOUR-OPENSHIFT-DEMO-USER-HTPASSWD->'

Run terraform init and apply to create the environment.

terraform init && terraform apply -auto-approve

Copy the ssh key and ansible-hosts file to the bastion host from where you need to run the Ansible OpenShift playbooks.

scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ./helper_scripts/id_rsa -r ./helper_scripts/id_rsa centos@$(terraform output bastion):/home/centos/.ssh/
scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ./helper_scripts/id_rsa -r ./inventory/ansible-hosts  centos@$(terraform output bastion):/home/centos/ansible-hosts

I recommend waiting a few minutes as the AWS cloud-init script prepares the bastion host. Afterwards continue with the pre and install playbooks. You can connect to the bastion host and run the playbooks directly.

ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ./helper_scripts/id_rsa -l centos $(terraform output bastion) -A "cd /openshift-ansible/ && ansible-playbook ./playbooks/openshift-pre.yml -i ~/ansible-hosts"
ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ./helper_scripts/id_rsa -l centos $(terraform output bastion) -A "cd /openshift-ansible/ && ansible-playbook ./playbooks/openshift-install.yml -i ~/ansible-hosts"

If for whatever reason the cluster deployment fails, you can run the uninstall playbook to bring the nodes back into a clean state and start from the beginning and run deploy_cluster.

ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ./helper_scripts/id_rsa -l centos $(terraform output bastion) -A "cd /openshift-ansible/ && ansible-playbook ./openshift-ansible/playbooks/adhoc/uninstall.yml -i ~/ansible-hosts"

Here are some screenshots of the new cluster console:

Let’s create a project and import my hello-openshift.yml build configuration:

Successful completed the build and deployed the hello-openshift container:

My example hello openshift application:

When you are finished with the testing, run terraform destroy.

terraform destroy -force 

 

Deploy OpenShift 3.9 Container Platform using Terraform and Ansible on Amazon AWS

After my previous articles on OpenShift and Terraform I wanted to show how to create the necessary infrastructure and to deploy an OpenShift Container Platform in a more real-world scenario. I highly recommend reading my other posts about using Terraform to deploy an Amazon AWS VPC and AWS EC2 Instances and Load Balancers. Once the infrastructure is created we will use the Bastion Host to connect to the environment and deploy OpenShift Origin using Ansible.

I think this might be an interesting topic to show what tools like Terraform and Ansible can do together:

I will not go into detail about the configuration and only show the output of deploying the infrastructure. Please checkout my Github repository to see the detailed configuration: https://github.com/berndonline/openshift-terraform

Before we start you need to clone the repository and generate the ssh key used from the bastion host to access the OpenShift nodes:

git clone https://github.com/berndonline/openshift-terraform.git
cd ./openshift-terraform/
ssh-keygen -b 2048 -t rsa -f ./helper_scripts/id_rsa -q -N ""
chmod 600 ./helper_scripts/id_rsa

We are ready to create the infrastructure and run terraform apply:

berndonline@lab:~/openshift-terraform$ terraform apply

...

Plan: 56 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

...

Apply complete! Resources: 19 added, 0 changed, 16 destroyed.

Outputs:

bastion = ec2-34-244-225-35.eu-west-1.compute.amazonaws.com
openshift master = master-35563dddc8b2ea9c.elb.eu-west-1.amazonaws.com
openshift subdomain = infra-1994425986.eu-west-1.elb.amazonaws.com
berndonline@lab:~/openshift-terraform$

Terraform successfully creates the VPC, load balancers and all needed instances. Before we continue wait 5 to 10 minutes because the cloud-init script takes a bit time and all the instance reboot at the end.

Instances:

Security groups:

Target groups for the Master and the Infra load balancers:

Master and the Infra load balancers:

Terraform also automatically creates the inventory file for the OpenShift installation and adds the hostnames for master, infra and worker nodes to the correct inventory groups. The next step is to copy the private ssh key and the inventory file to the bastion host. I am using the terraform output command to get the public hostname from the bastion host:

scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -r ./helper_scripts/id_rsa centos@$(terraform output bastion):/home/centos/.ssh/
scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -r ./inventory/ansible-hosts  centos@$(terraform output bastion):/home/centos/ansible-hosts
ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -l centos $(terraform output bastion)

On the bastion node, change to the /openshift-ansible/ folder and start running the prerequisites and the deploy-cluster playbooks:

cd /openshift-ansible/
ansible-playbook ./playbooks/prerequisites.yml -i ~/ansible-hosts
ansible-playbook ./playbooks/deploy_cluster.yml -i ~/ansible-hosts

Here the output from running the prerequisites playbook:

[centos@ip-10-0-0-22 ~]$ cd /openshift-ansible/
[centos@ip-10-0-0-22 openshift-ansible]$ ansible-playbook ./playbooks/prerequisites.yml -i ~/ansible-hosts

PLAY [Initialization Checkpoint Start] ****************************************************************************************************************************

TASK [Set install initialization 'In Progress'] *******************************************************************************************************************
Saturday 15 September 2018  11:04:50 +0000 (0:00:00.407)       0:00:00.407 ****
ok: [ip-10-0-1-237.eu-west-1.compute.internal]

PLAY [Populate config host groups] ********************************************************************************************************************************

TASK [Load group name mapping variables] **************************************************************************************************************************
Saturday 15 September 2018  11:04:50 +0000 (0:00:00.110)       0:00:00.517 ****
ok: [localhost]

TASK [Evaluate groups - g_etcd_hosts or g_new_etcd_hosts required] ************************************************************************************************
Saturday 15 September 2018  11:04:51 +0000 (0:00:00.033)       0:00:00.551 ****
skipping: [localhost]

TASK [Evaluate groups - g_master_hosts or g_new_master_hosts required] ********************************************************************************************
Saturday 15 September 2018  11:04:51 +0000 (0:00:00.024)       0:00:00.575 ****
skipping: [localhost]

TASK [Evaluate groups - g_node_hosts or g_new_node_hosts required] ************************************************************************************************
Saturday 15 September 2018  11:04:51 +0000 (0:00:00.024)       0:00:00.599 ****
skipping: [localhost]

...

PLAY RECAP ********************************************************************************************************************************************************
ip-10-0-1-192.eu-west-1.compute.internal : ok=56   changed=14   unreachable=0    failed=0
ip-10-0-1-237.eu-west-1.compute.internal : ok=64   changed=15   unreachable=0    failed=0
ip-10-0-1-248.eu-west-1.compute.internal : ok=56   changed=14   unreachable=0    failed=0
ip-10-0-5-174.eu-west-1.compute.internal : ok=56   changed=14   unreachable=0    failed=0
ip-10-0-5-235.eu-west-1.compute.internal : ok=58   changed=14   unreachable=0    failed=0
ip-10-0-5-35.eu-west-1.compute.internal : ok=56   changed=14   unreachable=0    failed=0
ip-10-0-9-130.eu-west-1.compute.internal : ok=56   changed=14   unreachable=0    failed=0
ip-10-0-9-51.eu-west-1.compute.internal : ok=58   changed=14   unreachable=0    failed=0
ip-10-0-9-85.eu-west-1.compute.internal : ok=56   changed=14   unreachable=0    failed=0
localhost                  : ok=11   changed=0    unreachable=0    failed=0


INSTALLER STATUS **************************************************************************************************************************************************
Initialization             : Complete (0:00:41)

[centos@ip-10-0-0-22 openshift-ansible]$

Continue with the deploy cluster playbook:

[centos@ip-10-0-0-22 openshift-ansible]$ ansible-playbook ./playbooks/deploy_cluster.yml -i ~/ansible-hosts

PLAY [Initialization Checkpoint Start] ****************************************************************************************************************************

TASK [Set install initialization 'In Progress'] *******************************************************************************************************************
Saturday 15 September 2018  11:08:38 +0000 (0:00:00.102)       0:00:00.102 ****
ok: [ip-10-0-1-237.eu-west-1.compute.internal]

PLAY [Populate config host groups] ********************************************************************************************************************************

TASK [Load group name mapping variables] **************************************************************************************************************************
Saturday 15 September 2018  11:08:38 +0000 (0:00:00.064)       0:00:00.167 ****
ok: [localhost]

TASK [Evaluate groups - g_etcd_hosts or g_new_etcd_hosts required] ************************************************************************************************
Saturday 15 September 2018  11:08:38 +0000 (0:00:00.031)       0:00:00.198 ****
skipping: [localhost]

TASK [Evaluate groups - g_master_hosts or g_new_master_hosts required] ********************************************************************************************
Saturday 15 September 2018  11:08:38 +0000 (0:00:00.026)       0:00:00.225 ****
skipping: [localhost]

...

PLAY RECAP ********************************************************************************************************************************************************
ip-10-0-1-192.eu-west-1.compute.internal : ok=132  changed=57   unreachable=0    failed=0
ip-10-0-1-237.eu-west-1.compute.internal : ok=591  changed=256  unreachable=0    failed=0
ip-10-0-1-248.eu-west-1.compute.internal : ok=132  changed=57   unreachable=0    failed=0
ip-10-0-5-174.eu-west-1.compute.internal : ok=132  changed=57   unreachable=0    failed=0
ip-10-0-5-235.eu-west-1.compute.internal : ok=325  changed=145  unreachable=0    failed=0
ip-10-0-5-35.eu-west-1.compute.internal : ok=132  changed=57   unreachable=0    failed=0
ip-10-0-9-130.eu-west-1.compute.internal : ok=132  changed=57   unreachable=0    failed=0
ip-10-0-9-51.eu-west-1.compute.internal : ok=325  changed=145  unreachable=0    failed=0
ip-10-0-9-85.eu-west-1.compute.internal : ok=132  changed=57   unreachable=0    failed=0
localhost                  : ok=13   changed=0    unreachable=0    failed=0

INSTALLER STATUS **************************************************************************************************************************************************
Initialization             : Complete (0:00:55)
Health Check               : Complete (0:00:01)
etcd Install               : Complete (0:01:03)
Master Install             : Complete (0:05:17)
Master Additional Install  : Complete (0:00:26)
Node Install               : Complete (0:08:24)
Hosted Install             : Complete (0:00:57)
Web Console Install        : Complete (0:00:28)
Service Catalog Install    : Complete (0:01:19)

[centos@ip-10-0-0-22 openshift-ansible]$

Once the deploy playbook finishes we have a working Openshift cluster:

Login with username: demo, and password: demo

For the infra load balancers you cannot access OpenShift routes via the Amazon DNS, this is not allowed. You need to create a wildcard DNS CNAME record like *.paas.domain.com and point to the AWS load balancer DNS record.

Let’s continue to do some basic cluster checks to see the nodes are in ready state:

[centos@ip-10-0-1-237 ~]$ oc get nodes
NAME                                       STATUS    ROLES     AGE       VERSION
ip-10-0-1-192.eu-west-1.compute.internal   Ready     compute   11m       v1.9.1+a0ce1bc657
ip-10-0-1-237.eu-west-1.compute.internal   Ready     master    16m       v1.9.1+a0ce1bc657
ip-10-0-1-248.eu-west-1.compute.internal   Ready         11m       v1.9.1+a0ce1bc657
ip-10-0-5-174.eu-west-1.compute.internal   Ready     compute   11m       v1.9.1+a0ce1bc657
ip-10-0-5-235.eu-west-1.compute.internal   Ready     master    15m       v1.9.1+a0ce1bc657
ip-10-0-5-35.eu-west-1.compute.internal    Ready         11m       v1.9.1+a0ce1bc657
ip-10-0-9-130.eu-west-1.compute.internal   Ready     compute   11m       v1.9.1+a0ce1bc657
ip-10-0-9-51.eu-west-1.compute.internal    Ready     master    14m       v1.9.1+a0ce1bc657
ip-10-0-9-85.eu-west-1.compute.internal    Ready         11m       v1.9.1+a0ce1bc657
[centos@ip-10-0-1-237 ~]$
[centos@ip-10-0-1-237 ~]$ oc get projects
NAME                                DISPLAY NAME   STATUS
default                                            Active
kube-public                                        Active
kube-service-catalog                               Active
kube-system                                        Active
logging                                            Active
management-infra                                   Active
openshift                                          Active
openshift-ansible-service-broker                   Active
openshift-infra                                    Active
openshift-node                                     Active
openshift-template-service-broker                  Active
openshift-web-console                              Active
[centos@ip-10-0-1-237 ~]$
[centos@ip-10-0-1-237 ~]$ oc get pods -o wide
NAME                       READY     STATUS    RESTARTS   AGE       IP           NODE
docker-registry-1-8798r    1/1       Running   0          10m       10.128.2.2   ip-10-0-5-35.eu-west-1.compute.internal
registry-console-1-zh9m4   1/1       Running   0          10m       10.129.2.3   ip-10-0-9-85.eu-west-1.compute.internal
router-1-96zzf             1/1       Running   0          10m       10.0.9.85    ip-10-0-9-85.eu-west-1.compute.internal
router-1-nfh7h             1/1       Running   0          10m       10.0.1.248   ip-10-0-1-248.eu-west-1.compute.internal
router-1-pcs68             1/1       Running   0          10m       10.0.5.35    ip-10-0-5-35.eu-west-1.compute.internal
[centos@ip-10-0-1-237 ~]$

At the end just destroy the infrastructure with terraform destroy:

berndonline@lab:~/openshift-terraform$ terraform destroy

...

Destroy complete! Resources: 56 destroyed.
berndonline@lab:~/openshift-terraform$

I will continue improving the configuration and I plan to use Jenkins to deploy the AWS infrastructure and OpenShift fully automatically.

Please let me know if you like the article or have questions in the comments below.