Getting started with Kubernetes Operators in Golang

In the past few weeks I started to learn Golang and beginners like me can make quick progress once you understand the structure and some basics about the programming language. I felt that from all the learning and reading I’ve done on Golang and Kubernetes operators, I had enough knowledge to start writing my own Kubernetes operator in Go.

At the beginning of last year, RedHat released the operator-sdk which helps to create the scaffolding for writing your own operators in Ansible, Helm or natively in Golang. There has been quite a few changes along the way around the operator-sdk and it is maturing a lot over the course of the past year.

The instructions on how to install Golang can be found on the Golang website and we need the latest version of the operator-sdk:

$ wget
$ mv operator-sdk-v1.2.0-x86_64-linux-gnu operator-sdk
$ sudo mv operator-sdk /usr/local/bin/

Create a new folder and start to initialise the project. You see that I have already set the option --domain so all API groups will be <-group-> The --repo option allows me to create the project folder outside of my $GOPATH environment. Infos about the folder structure you can find in the Kubebuilder documentation:

$ mkdir k8s-helloworld-operator
$ cd k8s-helloworld-operator
$ operator-sdk init

The last thing we need before we start writing the operator is to create a new API and Controller and this will scaffold the operator API at api/v1alpha1/operator_types.go and the controller at controllers/operator_controller.go.

$ operator-sdk create api --group app --version v1alpha1 --kind Operator
Create Resource [y/n]
Create Controller [y/n]
Writing scaffold for you to edit...
  • Define your API

Define your API for the operator custom resource by editing the Go type definitions at api/v1alpha1/operator_types.go

// OperatorSpec defines the desired state of Operator
type OperatorSpec struct {
	// INSERT ADDITIONAL SPEC FIELDS - desired state of cluster
	// Important: Run "make" to regenerate code after modifying this file

	// Foo is an example field of Operator. Edit Operator_types.go to remove/update
	Size     int32  `json:"size"`
	Image    string `json:"image"`
	Response string `json:"response"`
// OperatorStatus defines the observed state of Operator
type OperatorStatus struct {
	// INSERT ADDITIONAL STATUS FIELD - define observed state of cluster
	// Important: Run "make" to regenerate code after modifying this file
	Nodes []string `json:"nodes"`

// Operator is the Schema for the operators API
// +kubebuilder:subresource:status
type Operator struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	Spec   OperatorSpec   `json:"spec,omitempty"`
	Status OperatorStatus `json:"status,omitempty"`

After modifying the _types.go file you always need to run the following command to update the generated code for that resource type:

$ make generate 
/home/ubuntu/.go/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./..."
  • Generate Custom Resource Definition (CRD) manifests

In the previous step we defined the API with spec and status fields of the CRD manifests, which can be generated and updated with the following command:

$ make manifests
/home/ubuntu/.go/bin/controller-gen "crd:trivialVersions=true" rbac:roleName=manager-role webhook paths="./..." output:crd:artifacts:config=config/crd/bases

This makefile will invoke the controller-gen to generate the CRD manifests at config/crd/bases/app.helloworld.io_operators.yaml and below you see my custom resource example for the operator:

kind: Operator
  name: operator-sample
  size: 1
  response: "Hello, World!"
  image: ""
  • Controller

In the beginning when I created the API, the operator-sdk automatically created the controller file for me at controllers/operator_controller.go which we now start to modify and add the Golang code. I will not go into every detail because the different resources you will create will all look very similar and repeat like you will see in example code. I will mainly focus on the Deployment for my Helloworld container image which I want to deploy using the operator.

Let’s start looking at the deploymentForOperator function which defines and returns the Kubernetes Deployment object. You see there that I invoke an imported Golang packages like &appsv1.Deployment and the import is defined at the top of the controller file. You can find details about this in the Go Doc reference:

// deploymentForOperator returns a operator Deployment object
func (r *OperatorReconciler) deploymentForOperator(m *appv1alpha1.Operator) *appsv1.Deployment {
	ls := labelsForOperator(m.Name)
	replicas := m.Spec.Size

	dep := &appsv1.Deployment{
		ObjectMeta: metav1.ObjectMeta{
			Name:      m.Name,
			Namespace: m.Namespace,
		Spec: appsv1.DeploymentSpec{
			Replicas: &replicas,
			Selector: &metav1.LabelSelector{
				MatchLabels: ls,
			Template: corev1.PodTemplateSpec{
				ObjectMeta: metav1.ObjectMeta{
					Labels: ls,
				Spec: corev1.PodSpec{
					Containers: []corev1.Container{{
						Image:           m.Spec.Image,
						ImagePullPolicy: "Always",
						Name:            "helloworld",
						Ports: []corev1.ContainerPort{{
							ContainerPort: 8080,
							Name:          "operator",
						Env: []corev1.EnvVar{{
							Name:  "RESPONSE",
							Value: m.Spec.Response,
						EnvFrom: []corev1.EnvFromSource{{
							ConfigMapRef: &corev1.ConfigMapEnvSource{
								LocalObjectReference: corev1.LocalObjectReference{
									Name: m.Name,
						VolumeMounts: []corev1.VolumeMount{{
							Name:      m.Name,
							ReadOnly:  true,
							MountPath: "/helloworld/",
					Volumes: []corev1.Volume{{
						Name: m.Name,
						VolumeSource: corev1.VolumeSource{
							ConfigMap: &corev1.ConfigMapVolumeSource{
								LocalObjectReference: corev1.LocalObjectReference{
									Name: m.Name,

	// Set Operator instance as the owner and controller
	ctrl.SetControllerReference(m, dep, r.Scheme)
	return dep

We have defined the deploymentForOperator function and now we can look into the Reconcile function and add the step to check if the deployment already exists and, if not, to create the new deployment:

// Check if the deployment already exists, if not create a new one
found := &appsv1.Deployment{}
err = r.Get(ctx, types.NamespacedName{Name: operator.Name, Namespace: operator.Namespace}, found)
if err != nil && errors.IsNotFound(err) {
	// Define a new deployment
	dep := r.deploymentForOperator(operator)
	log.Info("Creating a new Deployment", "Deployment.Namespace", dep.Namespace, "Deployment.Name", dep.Name)
	err = r.Create(ctx, dep)
	if err != nil {
		log.Error(err, "Failed to create new Deployment", "Deployment.Namespace", dep.Namespace, "Deployment.Name", dep.Name)
		return ctrl.Result{}, err
	// Deployment created successfully - return and requeue
	return ctrl.Result{Requeue: true}, nil
} else if err != nil {
	log.Error(err, "Failed to get Deployment")
	return ctrl.Result{}, err

Unfortunately this isn’t enough because this will only check if the deployment exists or not and create a new deployment, but it will not update the deployment if the custom resource is changed.

We need to add two more steps to check if the created Deployment Spec.Template matches the Spec.Template from the  deploymentForOperator function and the Deployment Spec.Replicas the defined size from the custom resource. I will make use of the defined variable found := &appsv1.Deployment{} from the previous step when I checked if the deployment exists.

// Check if the deployment Spec.Template, matches the found Spec.Template
deploy := r.deploymentForOperator(operator)
if !equality.Semantic.DeepDerivative(deploy.Spec.Template, found.Spec.Template) {
	found = deploy
	log.Info("Updating Deployment", "Deployment.Namespace", found.Namespace, "Deployment.Name", found.Name)
	err := r.Update(ctx, found)
	if err != nil {
		log.Error(err, "Failed to update Deployment", "Deployment.Namespace", found.Namespace, "Deployment.Name", found.Name)
		return ctrl.Result{}, err
	return ctrl.Result{Requeue: true}, nil

// Ensure the deployment size is the same as the spec
size := operator.Spec.Size
if *found.Spec.Replicas != size {
	found.Spec.Replicas = &size
	err = r.Update(ctx, found)
	if err != nil {
		log.Error(err, "Failed to update Deployment", "Deployment.Namespace", found.Namespace, "Deployment.Name", found.Name)
		return ctrl.Result{}, err
	// Spec updated - return and requeue
	return ctrl.Result{Requeue: true}, nil

The SetupWithManager() function in controllers/operator_controller.go specifies how the controller is built to watch a custom resource and other resources that are owned and managed by that controller.

func (r *OperatorReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).

Basically that’s all I need to write for the controller to deploy my Helloworld container image using an Kubernetes operator. In my code example you will find that I also create a Kubernetes Service, Ingress and ConfigMap but you see that this mostly repeats what I have done with the Deployment object.

  • RBAC permissions

Before we can start running the operator, we need to define the RBAC permissions the controller needs to interact with the resources it manages otherwise your controller will not work. These are specified via [RBAC markers] like these:

// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=core,resources=services,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=core,resources=configmaps,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=core,resources=pods,verbs=get;list;watch

The ClusterRole manifest at config/rbac/role.yaml is generated from the above markers via controller-gen with the following command:

$ make manifests 
/home/ubuntu/.go/bin/controller-gen "crd:trivialVersions=true" rbac:roleName=manager-role webhook paths="./..." output:crd:artifacts:config=config/crd/bases
  • Running the Operator

We need a Kubernetes cluster and admin privileges to run the operator. I will use Kind which will run a lightweight Kubernetes cluster in your local Docker engine, which is all I need to run and test my Helloworld operator:

$ ./scripts/ 
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.19.1) 🖼 
 ✓ Preparing nodes 📦  
 ✓ Writing configuration 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Have a question, bug, or feature request? Let us know! 🙂

Before running the operator the custom resource Definition must be registered with the Kubernetes apiserver:

$ make install
/home/ubuntu/.go/bin/controller-gen "crd:trivialVersions=true" rbac:roleName=manager-role webhook paths="./..." output:crd:artifacts:config=config/crd/bases
/usr/bin/kustomize build config/crd | kubectl apply -f -
Warning: CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use CustomResourceDefinition created

We can now run the operator locally on my workstation:

$ make run
/home/ubuntu/.go/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./..."
go fmt ./...
go vet ./...
/home/ubuntu/.go/bin/controller-gen "crd:trivialVersions=true" rbac:roleName=manager-role webhook paths="./..." output:crd:artifacts:config=config/crd/bases
go run ./main.go
2020-11-22T18:12:49.023Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": ":8080"}
2020-11-22T18:12:49.024Z	INFO	setup	starting manager
2020-11-22T18:12:49.025Z	INFO	controller-runtime.manager	starting metrics server	{"path": "/metrics"}
2020-11-22T18:12:49.025Z	INFO	controller	Starting EventSource	{"reconcilerGroup": "", "reconcilerKind": "Operator", "controller": "operator", "source": "kind source: /, Kind="}
2020-11-22T18:12:49.126Z	INFO	controller	Starting EventSource	{"reconcilerGroup": "", "reconcilerKind": "Operator", "controller": "operator", "source": "kind source: /, Kind="}
2020-11-22T18:12:49.226Z	INFO	controller	Starting EventSource	{"reconcilerGroup": "", "reconcilerKind": "Operator", "controller": "operator", "source": "kind source: /, Kind="}
2020-11-22T18:12:49.327Z	INFO	controller	Starting EventSource	{"reconcilerGroup": "", "reconcilerKind": "Operator", "controller": "operator", "source": "kind source: /, Kind="}
2020-11-22T18:12:49.428Z	INFO	controller	Starting EventSource	{"reconcilerGroup": "", "reconcilerKind": "Operator", "controller": "operator", "source": "kind source: /, Kind="}
2020-11-22T18:12:49.528Z	INFO	controller	Starting Controller	{"reconcilerGroup": "", "reconcilerKind": "Operator", "controller": "operator"}
2020-11-22T18:12:49.528Z	INFO	controller	Starting workers	{"reconcilerGroup": "", "reconcilerKind": "Operator", "controller": "operator", "worker count": 1}

Let’s open a new terminal and apply the custom resource example:

$ kubectl apply -f config/samples/app_v1alpha1_operator.yaml created

Going back to the terminal where the operator is running, you see the log messages that it invoke the different functions to deploy the defined resource objects:

2020-11-22T18:15:30.412Z	INFO	controllers.Operator	Creating a new Deployment	{"operator": "default/operator-sample", "Deployment.Namespace": "default", "Deployment.Name": "operator-sample"}
2020-11-22T18:15:30.446Z	INFO	controllers.Operator	Creating a new ConfigMap	{"operator": "default/operator-sample", "ConfigMap.Namespace": "default", "ConfigMap.Name": "operator-sample"}
2020-11-22T18:15:30.453Z	INFO	controllers.Operator	Creating a new Service	{"operator": "default/operator-sample", "Service.Namespace": "default", "Service.Name": "operator-sample"}
2020-11-22T18:15:30.470Z	INFO	controllers.Operator	Creating a new Ingress	{"operator": "default/operator-sample", "Ingress.Namespace": "default", "Ingress.Name": "operator-sample"}
2020-11-22T18:15:30.927Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "", "reconcilerKind": "Operator", "controller": "operator", "name": "operator-sample", "namespace": "default"}
2020-11-22T18:15:30.927Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "", "reconcilerKind": "Operator", "controller": "operator", "name": "operator-sample", "namespace": "default"}
2020-11-22T18:15:33.776Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "", "reconcilerKind": "Operator", "controller": "operator", "name": "operator-sample", "namespace": "default"}
2020-11-22T18:15:35.181Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "", "reconcilerKind": "Operator", "controller": "operator", "name": "operator-sample", "namespace": "default"}

In the default namespace where I applied the custom resource you will see the deployed resources by the operator:

$ kubectl get 
NAME              AGE
operator-sample   6m11s
$ kubectl get all
NAME                                   READY   STATUS    RESTARTS   AGE
pod/operator-sample-767897c4b9-8zwsd   1/1     Running   0          2m59s

NAME                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/kubernetes        ClusterIP               443/TCP    29m
service/operator-sample   ClusterIP           8080/TCP   2m59s

NAME                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/operator-sample   1/1     1            1           2m59s

NAME                                         DESIRED   CURRENT   READY   AGE
replicaset.apps/operator-sample-767897c4b9   1         1         1       2m59s

There is not much else to do other than to build the operator image and push to an image registry so that I can run the operator on a Kubernetes cluster.

$ make docker-build
$ make docker-push
$ kustomize build config/default | kubectl apply -f -

I hope this article is useful for getting you started on writing your own Kubernetes operator in Golang.

Getting started with OpenShift Hive

If you don’t know OpenShift Hive I recommend having a look at the video of my talk at RedHat OpenShift Commons about OpenShift Hive where I also talk about how you can provision and manage the lifecycle of OpenShift 4 clusters using the Kubernetes API and the OpenShift Hive operator.

The Hive operator has three main components the admission controller,  the Hive controller and the Hive operator itself. For more information about the Hive architecture visit the Hive docs:

You can use an OpenShift or native Kubernetes cluster to run the operator, in my case I use a EKS cluster. Let’s go through the prerequisites which are required to generate the manifests and the hiveutil:

$ curl -s "\
> kubernetes-sigs/kustomize/master/hack/"  | bash
$ sudo mv ./kustomize /usr/bin/
$ wget
$ tar -xvf go1.13.3.linux-amd64.tar.gz
$ sudo mv go /usr/local

To setup the Go environment copy the content below and add to your .profile:

export GOPATH="${HOME}/.go"
export PATH="$PATH:/usr/local/go/bin"
export PATH="$PATH:${GOPATH}/bin:${GOROOT}/bin"

Continue with installing the Go dependencies and clone the OpenShift Hive Github repository:

$ mkdir -p ~/.go/src/
$ go get
$ go get
$ go get
$ go get
$ cd ~/.go/src/
$ git clone
$ cd hive/
$ git checkout remotes/origin/master

Before we run make deploy I would recommend modifying the Makefile that we only generate the Hive manifests without deploying them to Kubernetes:

$ sed -i -e 's#oc apply -f config/crds# #' -e 's#kustomize build overlays/deploy | oc apply -f -#kustomize build overlays/deploy > hive.yaml#' Makefile
$ make deploy
# The apis-path is explicitly specified so that CRDs are not created for v1alpha1
go run tools/vendor/ crd --apis-path=pkg/apis/hive/v1
CRD files generated, files can be found under path /home/ubuntu/.go/src/
go generate ./pkg/... ./cmd/...
# Deploy the operator manifests:
mkdir -p overlays/deploy
cp overlays/template/kustomization.yaml overlays/deploy
cd overlays/deploy && kustomize edit set image
kustomize build overlays/deploy > hive.yaml
rm -rf overlays/deploy

Quick look at the content of the hive.yaml manifest:

$ cat hive.yaml
apiVersion: v1
kind: Namespace
  name: hive
apiVersion: v1
kind: ServiceAccount
  name: hive-operator
  namespace: hive


apiVersion: apps/v1
kind: Deployment
    control-plane: hive-operator "1.0"
  name: hive-operator
  namespace: hive
  replicas: 1
  revisionHistoryLimit: 4
      control-plane: hive-operator "1.0"
        control-plane: hive-operator "1.0"
      - command:
        - /opt/services/hive-operator
        - --log-level
        - info
        - name: CLI_CACHE_DIR
          value: /var/cache/kubectl
        imagePullPolicy: Always
          failureThreshold: 1
            path: /debug/health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 10
        name: hive-operator
            cpu: 100m
            memory: 256Mi
        - mountPath: /var/cache/kubectl
          name: kubectl-cache
      serviceAccountName: hive-operator
      terminationGracePeriodSeconds: 10
      - emptyDir: {}
        name: kubectl-cache

Now we can apply the Hive custom resource definition (crds):

$ kubectl apply -f ./config/crds/ created created created created created created created created created created created created created created

And continue to apply the hive.yaml manifest for deploying the OpenShift Hive operator and its components:

$ kubectl apply -f hive.yaml
namespace/hive created
serviceaccount/hive-operator created created created created created created created created created created created created
deployment.apps/hive-operator created

For the Hive admission controller you need to generate a SSL certifcate:

$ ./hack/
~/Dropbox/hive/hiveadmission-certs ~/Dropbox/hive
2020/02/03 22:17:30 [INFO] generate received request
2020/02/03 22:17:30 [INFO] received CSR
2020/02/03 22:17:30 [INFO] generating key: ecdsa-256
2020/02/03 22:17:30 [INFO] encoded CSR configured approved
secret/hiveadmission-serving-cert created

Afterwards we can check if all the pods are running, this might take a few seconds:

$ kubectl get pods -n hive
NAME                                READY   STATUS    RESTARTS   AGE
hive-controllers-7c6ccc84b9-q7k7m   1/1     Running   0          31s
hive-operator-f9f4447fd-jbmkh       1/1     Running   0          55s
hiveadmission-6766c5bc6f-9667g      1/1     Running   0          27s
hiveadmission-6766c5bc6f-gvvlq      1/1     Running   0          27s

The Hive operator is successfully installed on your Kubernetes cluster but we are not finished yet. To create the required Cluster Deployment manifests we need to generate the hiveutil binary:

$ make hiveutil
go generate ./pkg/... ./cmd/...
go build -o bin/hiveutil

To generate Hive Cluster Deployment manifests just run the following hiveutil command below, I output the definition with -o into yaml:

$ bin/hiveutil create-cluster --cloud=aws mycluster -o yaml
apiVersion: v1
- apiVersion:
  kind: ClusterImageSet
    creationTimestamp: null
    name: mycluster-imageset
  status: {}
- apiVersion: v1
  kind: Secret
    creationTimestamp: null
    name: mycluster-aws-creds
    aws_access_key_id: <-YOUR-AWS-ACCESS-KEY->
    aws_secret_access_key: <-YOUR-AWS-SECRET-KEY->
  type: Opaque
- apiVersion: v1
    install-config.yaml: <-BASE64-ENCODED-OPENSHIFT4-INSTALL-CONFIG->
  kind: Secret
    creationTimestamp: null
    name: mycluster-install-config
  type: Opaque
- apiVersion:
  kind: ClusterDeployment
    creationTimestamp: null
    name: mycluster
    clusterName: mycluster
      servingCertificates: {}
    installed: false
          name: mycluster-aws-creds
        region: us-east-1
        name: mycluster-imageset
        name: mycluster-install-config
      availableUpdates: null
        force: false
        image: ""
        version: ""
      observedGeneration: 0
      versionHash: ""
- apiVersion:
  kind: MachinePool
    creationTimestamp: null
    name: mycluster-worker
      name: mycluster
    name: worker
          iops: 100
          size: 22
          type: gp2
        type: m4.xlarge
    replicas: 3
    replicas: 0
kind: List
metadata: {}

I hope this post is useful in getting you started with OpenShift Hive. In my next article I will go through the details of the OpenShift 4 cluster deployment with Hive.

Read my new article about OpenShift / OKD 4.x Cluster Deployment using OpenShift Hive

OpenShift Hive – API driven OpenShift cluster provisioning and management operator

RedHat invited me and my colleague Matt to speak at RedHat OpenShift Commons in London about the API driven OpenShift cluster provisioning and management operator called OpenShift Hive. We have been using OpenShift Hive for the past few months to provision and manage the OpenShift 4 estate across multiple environments. Below the video recording of our talk at OpenShift Commons London:

The Hive operator requires to run on a separate Kubernetes cluster to centrally provision and manage the OpenShift 4 clusters. With Hive you can manage hundreds of cluster deployments and configuration with a single operator. There is nothing required on the OpenShift 4 clusters itself, Hive only requires access to the cluster API:

The ClusterDeployment custom resource is the definition for the cluster specs, similar to the openshift-installer install-config where you define cluster specifications, cloud credential and image pull secrets. Below is an example of the ClusterDeployment manifest:

kind: ClusterDeployment
  name: mycluster
  namespace: mynamespace
  clusterName: mycluster
        name: mycluster-aws-creds
      region: eu-west-1
      name: openshift-v4.3.0
      name: mycluster-install-config
      name: mycluster-ssh-key
    name: mycluster-pull-secret

The SyncSet custom resource is defining the configuration and is able to regularly reconcile the manifests to keep all clusters synchronised. With SyncSets you can apply resources and patches as you see in the example below:

kind: SyncSet
  name: mygroup
  - name: ClusterName
  resourceApplyMode: Upsert
  - apiVersion:
    kind: Group
      name: mygroup
    - myuser
  - kind: ConfigMap
    apiVersion: v1
    name: foo
    namespace: default
    patch: |-
      { "data": { "foo": "new-bar" } }
    patchType: merge
  - source:
      name: ad-bind-password
      namespace: default
      name: ad-bind-password
      namespace: openshift-config

Depending of the amount of resource and patches you want to apply, a SyncSet can get pretty large and is not very easy to manage. My colleague Matt wrote a SyncSet Generator, please check this Github repository.

In one of my next articles I will go into more detail on how to deploy OpenShift Hive and I’ll provide more examples of how to use ClusterDeployment and SyncSets. In the meantime please check out the OpenShift Hive repository for more details, additionally here are links to the Hive documentation on using Hive and Syncsets.

Read my new article about installing OpenShift Hive.

OpenShift Container Platform Troubleshooting Guide

On the first look OpenShift/Kubernetes seems like a very complex platform but once you start to get to know the different components and what they are doing, you will see it gets easier and easier. The purpose of this article to give you an every day guide based on my experience on how to successfully troubleshoot issues on OpenShift.

  • OpenShift service logging
# OpenShift 3.1 to OpenShift 3.9:

# OpenShift 3.10 and later versions:
/etc/origin/master/master.env # for API and Controllers

The log levels for the OpenShift services can be controlled via the –loglevel parameter in the service options.

0 – Errors and warnings only
2 – Normal information
4 – Debugging information
6 – API- debugging information (request / response)
8 – Body API debugging information

For example add or edit the line in /etc/sysconfig/atomic-openshift-node to OPTIONS=’–loglevel=4′ and afterward restart the service with systemctl restart atomic-openshift-node.

Viewing OpenShift service logs:

# OpenShift 3.1 to OpenShift 3.9:
journalctl -u atomic-openshift-master-api
journalctl -u atomic-openshift-master-controllers
journalctl -u atomic-openshift-node
journalctl -u etcd # or 'etcd_container' for containerized install

# OpenShift 3.10 and later versions:
/usr/local/bin/master-logs api api
/usr/local/bin/master-logs controllers controllers
/usr/local/bin/master-logs etcd etcd
journalctl -u atomic-openshift-node
  • Docker service logging

Change the docker daemon log level and add the parameter –log-level for the OPTIONS variable in dockers service file located in /etc/sysconfig/docker.

The available log levels are: ( debug, info, warn, error, fatal )

See the example below; to enable debug logging in /etc/sysconfig/docker to set log level equal to debug (After making the changes on the docker service you need to will restart with systemctl restart docker.):

OPTIONS='--insecure-registry= --selinux-enabled --log-level=debug'
  • OC command logging

The oc and oadm command also accept a loglevel option that can help get additional information. Value between 6 and 8 will provide extensive logging, API requests (loglevel 6), API headers (loglevel 7) and API responses received (loglevel 8):

oc whoami --loglevel=8
  • OpenShift SkyDNS

SkyDNS is the internal service discovery for OpenShift and DNS is important for OpenShift to function:

# Test full qualified cluster domain name
nslookup docker-registry.default.svc.cluster.local
# OR
dig +short docker-registry.default.svc.cluster.local

# Check if clusterip match the previous result
oc get svc/docker-registry -n default

# Test short name
nslookup docker-registry.default.svc
nslookup <endpoint-name>.<project-name>.svc

If short name doesn’t work look out if cluster.local is missing in dns search suffix. If resolution doesn’t work at all before enable debug logging, check if Dnsmasq service running and correctly configured. OpenShift uses a dispatcher script to maintain the DNS configuration of a node.

Add the options “–logspec ‘dns=10’” to the /etc/sysconfig/atomic-openshift-node service configuration on a node running skydns and restart the atomic-openshift-node service afterwards. There will then be skydns debug information in the journalctl logs.

OPTIONS="--loglevel=2 --logspec dns*=10"
  • OpenShift Master API and Web Console

In the following example, the is used by the internal cluster, and the is used by external clients

# Run the following commands on any node host
curl -k

# The OpenShift API service runs on all master instances. To see the status of the service, view the master-api pods in the kube-system project:
oc get pod -n kube-system -l
oc get pod -n kube-system -o wide
curl -k --insecure https://$HOSTNAME:8443/healthz
  • OpenShift Controller role

The OpenShift Container Platform controller service is available on all master nodes. The service runs in active/passive mode, which means it should only be running on one master.

# Verify the master host running the controller service
oc get -n kube-system cm openshift-master-controllers -o yaml
    • OpenShift Certificates

During the installation of OpenShift the playbooks generates a CA to sign every certificate in the cluster. One of the most common issues are expired node certificates. Below are a list of important certificate files:

# Is the OpenShift Certificate Authority, and it signs every other certificate unless specified otherwise.

# Contains a bundle with the current and the old CA's (if exists) to trust them all. If there has been only one ca.crt, then this file is the same as ca.crt.

# The internal API, also known as cluster internal address or the variable masterURL here all the internal components authenticates to access the API, such as nodes, routers and other services.

# Master controller certificate authenticates to kubernetes as a client using the admin.kubeconfig

# Node certificates
/etc/origin/node/ca.crt                   # to be able to trust the API, a copy of masters CA bundle is placed in:
/etc/origin/node/server.crt               # to secure this communication
/etc/origin/node/system:node:{fqdn}.crt   # Nodes needs to authenticate to the Kubernetes API as a client. 

# Etcd certificates
/etc/etcd/ca.crt                          # is the etcd CA, it is used to sign every certificate.
/etc/etcd/server.crt                      # is used by the etcd to listen to clients.
/etc/etcd/peer.crt                        # is used by etcd to authenticate as a client.

# Master certificates to auth to etcd
/etc/origin/master/master.etcd-ca.crt     # is a copy of /etc/etcd/ca.crt. Used to trust the etcd cluster.
/etc/origin/master/master.etcd-client.crt # is used to authenticate as a client of the etcd cluster.

# Services ca certificate. All self-signed internal certificates are signed by this CA.

Here’s an example to check the validity of the master server certificate:

cat /etc/origin/master/master.server.crt | openssl x509 -text | grep -i Validity -A2
# OR
openssl x509 -enddate -noout -in /etc/origin/master/master.server.crt

It’s worth checking the documentation about how to re-deploy certificates on OpenShift.

  • OpenShift etcd

On the etcd node (master) set source to etcd.conf file for most of the needed variables.

source /etc/etcd/etcd.conf
export ETCDCTL_API=3

# Set endpoint variable to include all etcd endpoints
ETCD_ALL_ENDPOINTS=` etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS --write-out=fields   member list | awk '/ClientURL/{printf "%s%s",sep,$3; sep=","}'`

# Cluster status and health checks
etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS --write-out=table  member list
etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_ALL_ENDPOINTS  --write-out=table endpoint status
etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_ALL_ENDPOINTS endpoint health

Check etcd database key entries:

etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints="https://$(hostname):2379" get / --prefix --keys-only
  • OpenShift Registry

To get detailed information about the pods running the internal registry run the following command:

oc get pods -n default | grep registry | awk '{ print $1 }' | xargs -i oc describe pod {}

For a basic health check that the internal registry is running and responding you need to “curl” the /healthz path. Normally this should return a 200 HTTP response:

Registry=$(oc get svc docker-registry -n default -o 'jsonpath={.spec.clusterIP}:{.spec.ports[0].port}')

curl -vk $Registry/healthz
# OR
curl -vk https://$Registry/healthz

If a persistent volume is attached to the registry make sure that the registry can write to the volume.

oc project default 
oc rsh `oc get pods -o name -l docker-registry`

$ touch /registry/test-file
$ ls -la /registry/ 
$ rm /registry/test-file
$ exit

If the registry is insecure make sure you have edited the /etc/sysconfig/docker file and add –insecure-registry to the OPTIONS parameter on the nodes.

For more information about testing the internal registry please have a look at the documentation about Accessing the Registry.

  • OpenShift Router 

To increase the log level for OpenShift router pod, set loglevel=4 in the container args:

# Increase logging level
oc patch dc -n default router -p '[{"op": "add", "path": "/spec/template/spec/containers/0/args", "value":["--loglevel=4"]}]' --type=json 

# View logs
oc logs <router-pod-name> -n default

# Remove logging change 
oc patch dc -n default router -p '[{"op": "remove", "path": "/spec/template/spec/containers/0/args", "value":["--loglevel=4"]}]' --type=json

OpenShift router image version 3.3 and later, the logging for http requests can be forwarded to an external syslog server:

oc set env dc/router ROUTER_SYSLOG_ADDRESS=<syslog-server-ip> ROUTER_LOG_LEVEL=debug

If you are facing issues with ingress routes to your application run the below command to collect more information:

oc logs dc/router -n default
oc get dc/router -o yaml -n default
oc get route <route-name> -n <project-name> 
oc get endpoints --all-namespaces 
oc exec -it <router-pod-name> -- ls -la 
oc exec -it <router-pod-name> -- find /var/lib/haproxy -regex ".*\(.map\|config.*\|.json\)" -print -exec cat {} \; > haproxy_configs_and_maps

Check if your application domain is / and dig for an ANSWER containing the load balancer VIP address:

dig \*

Confirm that certificates are being severed out correctly by running the following:

echo -n | openssl s_client -connect :443 -servername 2>&1 | openssl x509 -noout -text
curl -kv 
  • OpenShift SDN

Please checkout the official Troubleshooting OpenShift SDN documentation

To get OpenFlow table export, connect to the openvswitch container and run following command:

docker exec openvswitch ovs-ofctl -O OpenFlow13 dump-flows br0
  • OpenShift Namespace events

Useful to collect events from the namespace to identify pod creation issues before you did in the container logs:

oc get events [-n |--all-namespaces]

In the default namespace you find relevant events for monitoring or auditing a cluster, such as Node and resource events related to the OpenShift platform.

  • OpenShift Pod and Container Logs

Container/pod logs can be viewed using the OpenShift oc command line. Add option “-p” to print the logs for the previous instance of the container in a pod if it exists and add option “-f” to stream the logs:

oc logs <pod-name> [-f]

The logs are saved to the worker nodes disk where the container/pod is running and it is located at:

For setting the log file limits for containers on a worker node the –log-opt can be configured with max-size and max-file so that a containers logs are rolled over:

# cat /etc/sysconfig/docker 
OPTIONS='--insecure-registry= --selinux-enabled --log-opt max-size=50m --log-opt max-file=5'

# Restart docker service for the changes to take effect.
systemctl restart docker 

To remove all logs from a given container run the following commands:

cat /dev/null > /var/lib/docker/containers/<container-id>/<container-id>-json.log
# OR
cat /dev/null >  $(docker inspect --format='{{.LogPath}}' <container-id> )

To generate a list of the largest files run the following commands:

# Log files
find /var/lib/docker/ -name "*.log" -exec ls -sh {} \; | sort -n -r | head -20

# All container files
du -aSh /var/lib/docker/ | sort -n -r | head -n 10

Finding out the veth# interface of a docker container and use tcpdump to capture traffic more easily. The iflink of the container is the same as the ifindex of the veth#. You can get the iflink of the container as follows:

docker exec -it <container-name>  bash -c 'cat /sys/class/net/eth0/iflink'

# Let's say that the results in 14, then grep for 14
grep -l 14 /sys/class/net/veth*/ifindex

# Which will give a unique result on the worker node

Here a simple bash script to get the container and veth id’s:

for container in $(docker ps -q); do
    iflink=`docker exec -it $container bash -c 'cat /sys/class/net/eth0/iflink'`
    iflink=`echo $iflink|tr -d '\r'`
    veth=`grep -l $iflink /sys/class/net/veth*/ifindex`
    veth=`echo $veth|sed -e 's;^.*net/\(.*\)/ifindex$;\1;'`
    echo $container:$veth
  • OpenShift Builder Pod Logs

If you want to troubleshoot a particular build of “myapp” you can view logs with:

oc logs [bc/|dc/]<name> [-f]

To increase the logging level add a BUILD_LOGLEVEL environment variable to the BuildConfig strategy:

    - name: "BUILD_LOGLEVEL"
      value: "5"

I hope you found this article useful and that it helped you troubleshoot OpenShift. Please let me know what you think and leave a comment.

Deploy OpenShift 3.9 Container Platform using Terraform and Ansible on Amazon AWS

After my previous articles on OpenShift and Terraform I wanted to show how to create the necessary infrastructure and to deploy an OpenShift Container Platform in a more real-world scenario. I highly recommend reading my other posts about using Terraform to deploy an Amazon AWS VPC and AWS EC2 Instances and Load Balancers. Once the infrastructure is created we will use the Bastion Host to connect to the environment and deploy OpenShift Origin using Ansible.

I think this might be an interesting topic to show what tools like Terraform and Ansible can do together:

I will not go into detail about the configuration and only show the output of deploying the infrastructure. Please checkout my Github repository to see the detailed configuration:

Before we start you need to clone the repository and generate the ssh key used from the bastion host to access the OpenShift nodes:

git clone
cd ./openshift-terraform/
ssh-keygen -b 2048 -t rsa -f ./helper_scripts/id_rsa -q -N ""
chmod 600 ./helper_scripts/id_rsa

We are ready to create the infrastructure and run terraform apply:

[email protected]:~/openshift-terraform$ terraform apply


Plan: 56 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes


Apply complete! Resources: 19 added, 0 changed, 16 destroyed.


bastion =
openshift master =
openshift subdomain =
[email protected]:~/openshift-terraform$

Terraform successfully creates the VPC, load balancers and all needed instances. Before we continue wait 5 to 10 minutes because the cloud-init script takes a bit time and all the instance reboot at the end.


Security groups:

Target groups for the Master and the Infra load balancers:

Master and the Infra load balancers:

Terraform also automatically creates the inventory file for the OpenShift installation and adds the hostnames for master, infra and worker nodes to the correct inventory groups. The next step is to copy the private ssh key and the inventory file to the bastion host. I am using the terraform output command to get the public hostname from the bastion host:

scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -r ./helper_scripts/id_rsa [email protected]$(terraform output bastion):/home/centos/.ssh/
scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -r ./inventory/ansible-hosts  [email protected]$(terraform output bastion):/home/centos/ansible-hosts
ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -l centos $(terraform output bastion)

On the bastion node, change to the /openshift-ansible/ folder and start running the prerequisites and the deploy-cluster playbooks:

cd /openshift-ansible/
ansible-playbook ./playbooks/prerequisites.yml -i ~/ansible-hosts
ansible-playbook ./playbooks/deploy_cluster.yml -i ~/ansible-hosts

Here the output from running the prerequisites playbook:

[[email protected] ~]$ cd /openshift-ansible/
[[email protected] openshift-ansible]$ ansible-playbook ./playbooks/prerequisites.yml -i ~/ansible-hosts

PLAY [Initialization Checkpoint Start] ****************************************************************************************************************************

TASK [Set install initialization 'In Progress'] *******************************************************************************************************************
Saturday 15 September 2018  11:04:50 +0000 (0:00:00.407)       0:00:00.407 ****
ok: []

PLAY [Populate config host groups] ********************************************************************************************************************************

TASK [Load group name mapping variables] **************************************************************************************************************************
Saturday 15 September 2018  11:04:50 +0000 (0:00:00.110)       0:00:00.517 ****
ok: [localhost]

TASK [Evaluate groups - g_etcd_hosts or g_new_etcd_hosts required] ************************************************************************************************
Saturday 15 September 2018  11:04:51 +0000 (0:00:00.033)       0:00:00.551 ****
skipping: [localhost]

TASK [Evaluate groups - g_master_hosts or g_new_master_hosts required] ********************************************************************************************
Saturday 15 September 2018  11:04:51 +0000 (0:00:00.024)       0:00:00.575 ****
skipping: [localhost]

TASK [Evaluate groups - g_node_hosts or g_new_node_hosts required] ************************************************************************************************
Saturday 15 September 2018  11:04:51 +0000 (0:00:00.024)       0:00:00.599 ****
skipping: [localhost]


PLAY RECAP ******************************************************************************************************************************************************** : ok=56   changed=14   unreachable=0    failed=0 : ok=64   changed=15   unreachable=0    failed=0 : ok=56   changed=14   unreachable=0    failed=0 : ok=56   changed=14   unreachable=0    failed=0 : ok=58   changed=14   unreachable=0    failed=0 : ok=56   changed=14   unreachable=0    failed=0 : ok=56   changed=14   unreachable=0    failed=0 : ok=58   changed=14   unreachable=0    failed=0 : ok=56   changed=14   unreachable=0    failed=0
localhost                  : ok=11   changed=0    unreachable=0    failed=0

INSTALLER STATUS **************************************************************************************************************************************************
Initialization             : Complete (0:00:41)

[[email protected] openshift-ansible]$

Continue with the deploy cluster playbook:

[[email protected] openshift-ansible]$ ansible-playbook ./playbooks/deploy_cluster.yml -i ~/ansible-hosts

PLAY [Initialization Checkpoint Start] ****************************************************************************************************************************

TASK [Set install initialization 'In Progress'] *******************************************************************************************************************
Saturday 15 September 2018  11:08:38 +0000 (0:00:00.102)       0:00:00.102 ****
ok: []

PLAY [Populate config host groups] ********************************************************************************************************************************

TASK [Load group name mapping variables] **************************************************************************************************************************
Saturday 15 September 2018  11:08:38 +0000 (0:00:00.064)       0:00:00.167 ****
ok: [localhost]

TASK [Evaluate groups - g_etcd_hosts or g_new_etcd_hosts required] ************************************************************************************************
Saturday 15 September 2018  11:08:38 +0000 (0:00:00.031)       0:00:00.198 ****
skipping: [localhost]

TASK [Evaluate groups - g_master_hosts or g_new_master_hosts required] ********************************************************************************************
Saturday 15 September 2018  11:08:38 +0000 (0:00:00.026)       0:00:00.225 ****
skipping: [localhost]


PLAY RECAP ******************************************************************************************************************************************************** : ok=132  changed=57   unreachable=0    failed=0 : ok=591  changed=256  unreachable=0    failed=0 : ok=132  changed=57   unreachable=0    failed=0 : ok=132  changed=57   unreachable=0    failed=0 : ok=325  changed=145  unreachable=0    failed=0 : ok=132  changed=57   unreachable=0    failed=0 : ok=132  changed=57   unreachable=0    failed=0 : ok=325  changed=145  unreachable=0    failed=0 : ok=132  changed=57   unreachable=0    failed=0
localhost                  : ok=13   changed=0    unreachable=0    failed=0

INSTALLER STATUS **************************************************************************************************************************************************
Initialization             : Complete (0:00:55)
Health Check               : Complete (0:00:01)
etcd Install               : Complete (0:01:03)
Master Install             : Complete (0:05:17)
Master Additional Install  : Complete (0:00:26)
Node Install               : Complete (0:08:24)
Hosted Install             : Complete (0:00:57)
Web Console Install        : Complete (0:00:28)
Service Catalog Install    : Complete (0:01:19)

[[email protected] openshift-ansible]$

Once the deploy playbook finishes we have a working Openshift cluster:

Login with username: demo, and password: demo

For the infra load balancers you cannot access OpenShift routes via the Amazon DNS, this is not allowed. You need to create a wildcard DNS CNAME record like * and point to the AWS load balancer DNS record.

Let’s continue to do some basic cluster checks to see the nodes are in ready state:

[[email protected] ~]$ oc get nodes
NAME                                       STATUS    ROLES     AGE       VERSION   Ready     compute   11m       v1.9.1+a0ce1bc657   Ready     master    16m       v1.9.1+a0ce1bc657   Ready         11m       v1.9.1+a0ce1bc657   Ready     compute   11m       v1.9.1+a0ce1bc657   Ready     master    15m       v1.9.1+a0ce1bc657    Ready         11m       v1.9.1+a0ce1bc657   Ready     compute   11m       v1.9.1+a0ce1bc657    Ready     master    14m       v1.9.1+a0ce1bc657    Ready         11m       v1.9.1+a0ce1bc657
[[email protected] ~]$
[[email protected] ~]$ oc get projects
NAME                                DISPLAY NAME   STATUS
default                                            Active
kube-public                                        Active
kube-service-catalog                               Active
kube-system                                        Active
logging                                            Active
management-infra                                   Active
openshift                                          Active
openshift-ansible-service-broker                   Active
openshift-infra                                    Active
openshift-node                                     Active
openshift-template-service-broker                  Active
openshift-web-console                              Active
[[email protected] ~]$
[[email protected] ~]$ oc get pods -o wide
NAME                       READY     STATUS    RESTARTS   AGE       IP           NODE
docker-registry-1-8798r    1/1       Running   0          10m
registry-console-1-zh9m4   1/1       Running   0          10m
router-1-96zzf             1/1       Running   0          10m
router-1-nfh7h             1/1       Running   0          10m
router-1-pcs68             1/1       Running   0          10m
[[email protected] ~]$

At the end just destroy the infrastructure with terraform destroy:

[email protected]:~/openshift-terraform$ terraform destroy


Destroy complete! Resources: 56 destroyed.
[email protected]:~/openshift-terraform$

I will continue improving the configuration and I plan to use Jenkins to deploy the AWS infrastructure and OpenShift fully automatically.

Please let me know if you like the article or have questions in the comments below.