Kubernetes Operators: Custom Kubernetes Controller Guide

Picture this: your StatefulSet starts clean. Probes pass. Logs look fine. And then, somewhere in that fifteen-second window between "Running" and "Ready," your app tries to connect to a database that's still mid-initialization.

Three hours go into it. Grep logs. Adjust probe timings. Restart pods in a different order and watch hopefully. Eventually the truth lands: Kubernetes had zero knowledge that these two services had an ordering relationship at all. It was behaving perfectly. Scheduling containers. Restarting failures. Updating status. The rule that said don't serve traffic until the migration finishes lived in a Slack thread from eight months ago. Not anywhere near the cluster.

That's the problem Operators were built for. And once you've hit it once, you don't want to hit it again.

What an Operator Actually Is
Setting Up the Tools
Building a Real Operator, Step by Step
Applying Your First Custom Resource
Why This Pattern Actually Matters

What an Operator Actually Is (Not the Definition, the Reality)

Most people hear "Kubernetes Operator" and reach for the wrong analogy. It's not a plugin. Not middleware. Not a special cluster mode you toggle on.

Here's what it is: a Go program, running as an ordinary pod inside your cluster, watching the API server and taking action whenever something you care about changes. You define what "something you care about" means and what action makes sense. Kubernetes just runs the loop.

Two components make it work.

Custom Resource Definitions give you new vocabulary. You're not extending Kubernetes so much as teaching it a new noun. Call it MyApp, PostgresCluster, KafkaTopic, whatever fits your domain. From Kubernetes' perspective, a CRD is just another API object with a spec and a status. To your program, it's a declaration of intent it can read and act on.

Custom controllers do the actual work. Watch for changes, compare what exists against what should exist, close the gap. Create, update, delete. The loop fires on all of them, plus on a configurable requeue interval even when nothing changed. Because things drift. Pods restart. Network partitions heal. A well-written controller stays honest about reality rather than trusting its last known state.

One thing worth saying plainly: writing an Operator is real engineering work. A three-service app with no stateful components? Overkill. Full stop. For a database cluster you operate yourself, or a distributed queue, or any workload where the upgrade path has ordering constraints humans keep getting wrong at 2am. That's where this pattern pays back the investment. Tribal knowledge encoded in wikis doesn't page you when it's needed. A running controller does.

Setting Up the Tools

Three things to install. Do them in order and you'll avoid dependency headaches.

# Install kubectl
curl -LO https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

# Install minikube
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikube

# Install Operator SDK
curl -LO https://github.com/operator-framework/operator-sdk/releases/latest/download/operator-sdk_linux_amd64
sudo install operator-sdk_linux_amd64 /usr/local/bin/operator-sdk

minikube is fine for working through this. When you move to a real cluster, check RBAC before anything else. Your controller needs specific verbs on specific API groups, declared in marker annotations the SDK uses to generate a ClusterRole. Get those wrong and reconciliation silently stops. No obvious error. No crash. Nothing happening. You'll sit there wondering if the binary even started.

Quick sanity check: Run kubectl auth can-i list myapps --as=system:serviceaccount:default:myapp-operator after deploying. If it returns no, that's your problem right there.

Building a Real Operator, Step by Step

For this walkthrough, the custom resource is called MyApp. Simple web service, returns a welcome message. Small enough to hold in your head, structured enough to extend into something real.

Step 1: Initialize the Project

operator-sdk init --domain=mycompany.com --repo=github.com/mycompany/myapp-operator
cd myapp-operator

After this runs, you have a complete project structure: manager entrypoint, scheme registration, controller-runtime wiring. None of that is yours to write. Your job is the domain logic and the type definitions. The SDK owns the scaffolding.

Step 2: Define Your Custom Resource

Open api/v1/myapp_types.go and write the shape of your resource:

// api/v1/myapp_types.go

package v1

import metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

// MyAppSpec defines the desired state of MyApp
type MyAppSpec struct {
  // Add fields as needed
}

// MyAppStatus defines the observed state of MyApp
type MyAppStatus struct {
  // Add fields as needed
}

// +kubebuilder:object:root=true
// +kubebuilder:subresource:status

// MyApp is the Schema for the myapps API
type MyApp struct {
  metav1.TypeMeta   `json:",inline"`
  metav1.ObjectMeta `json:"metadata,omitempty"`

  Spec   MyAppSpec   `json:"spec,omitempty"`
  Status MyAppStatus `json:"status,omitempty"`
}

// +kubebuilder:object:root=true

// MyAppList contains a list of MyApp
type MyAppList struct {
  metav1.TypeMeta `json:",inline"`
  metav1.ListMeta `json:"metadata,omitempty"`
  Items           []MyApp `json:"items"`
}

func init() {
  SchemeBuilder.Register(&MyApp{}, &MyAppList{})
}

Spec is intent. Status is observed reality. Keep them separate in your mental model too, not just in the struct. When your program reads Spec, it's asking "what does the user want?" When it reads Status, it's asking "what does the cluster actually have right now?" Blur that line and you'll end up writing a program that gets confused about what it's even trying to fix. Those bugs are miserable to trace.

Step 3: Generate the Scaffolding

operator-sdk create api --group=myapp --version=v1 --kind=MyApp

One command. Produces the controller file, the CRD manifest, all the RBAC marker annotations pre-populated. You're not starting from a blank file.

Step 4: Write the Reconcile Function

This is the only part that genuinely matters. Open controllers/myapp_controller.go:

// controllers/myapp_controller.go

package controllers

import (
 "context"
 "reflect"

 myappv1 "github.com/mycompany/myapp-operator/api/v1"
 ctrl "sigs.k8s.io/controller-runtime"
 "sigs.k8s.io/controller-runtime/pkg/client"
)

// MyAppReconciler reconciles a MyApp object
type MyAppReconciler struct {
 client.Client
 Log    logr.Logger
 Scheme *runtime.Scheme
}

// +kubebuilder:rbac:groups=myapp.mycompany.com,resources=myapps,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=myapp.mycompany.com,resources=myapps/status,verbs=get;update;patch

func (r *MyAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
 log := r.Log.WithValues("myapp", req.NamespacedName)

 // Fetch the MyApp instance
 myapp := &myappv1.MyApp{}
 err := r.Get(ctx, req.NamespacedName, myapp)
 if err != nil {
 log.Error(err, "unable to fetch MyApp")
 return ctrl.Result{}, client.IgnoreNotFound(err)
 }

 // Reconciliation logic here

 return ctrl.Result{}, nil
}

func (r *MyAppReconciler) SetupWithManager(mgr ctrl.Manager) error {
 return ctrl.NewControllerManagedBy(mgr).
 For(&myappv1.MyApp{}).
 Complete(r)
}

Reconcile fires on every watched event. Read current state, find the gap, close it. One rule you cannot skip: write this function to be idempotent. Running it twice on the same unchanged state should produce zero side effects the second time. Kubernetes calls it more often than you expect. During startup. After leader election handoffs. After the controller pod itself restarts. On a cluster with hundreds of existing objects, that initial watch-list flush means a lot of calls in the first thirty seconds. Plan for it.

Build incrementally inside this function. One concrete action first. Confirm it works end-to-end. Then add the next thing. Chain multiple writes in a single pass and you'll spend real time debugging a reconciler that does five things at once. It's not fun.

Step 5: Build and Run

operator-sdk build myapp-operator
docker push myapp-operator
operator-sdk run local --watch-namespace=default

Applying Your First Custom Resource

With everything running, create a MyApp object:

# myapp-instance.yaml

apiVersion: myapp.mycompany.com/v1
kind: MyApp
metadata:
  name: example-myapp
spec:
  # Add custom spec fields as needed

Apply it:

kubectl apply -f myapp-instance.yaml

Check the logs:

kubectl logs deployment/myapp-operator-controller-manager -n default -c manager

Nothing dramatic happens yet. The Reconcile body is empty. That's intentional. What you're verifying is that the event reaches the function, the object gets fetched, and nothing panics. Boring success. That's the right baseline before adding real behavior.

Why This Pattern Actually Matters

Back to that incident. StatefulSet race condition, ordering dependency, three hours of log-grepping.

A well-written controller rewrites that scenario entirely. Before touching the app, the program checks whether the database object has a Ready status condition set to true. If not, early return, requeue in ten seconds, check again. No race. No guesswork. No 2am page because someone deployed services in the wrong order.

I've watched this pattern play out across a few different teams. Over months of running something stateful, operational knowledge accumulates in scattered places. Incident retrospectives. On-call handoff notes. The heads of whoever was there when things broke. Putting it in a controller pulls it into code. Version-controlled. Testable. Running continuously, not sitting in a doc nobody opens until something's already on fire.

This scaffold is minimal by design. What goes inside Reconcile is where the real investment lives, and it pays back as the workload grows more complex. Start simple. Keep the function idempotent. Let the pattern do what it was designed for.

Himanshu Pant Chief Operating Officer at Innostax

About Innostax

Innostax is a global software consulting and custom software development company helping growth-stage startups, scaleups, and enterprises build reliable, scalable digital products. Founded in 2014 and headquartered in Framingham, Massachusetts, Innostax specializes in custom software development, web and mobile app development, IT staff augmentation, offshore software development, and digital transformation services — across industries including healthcare, retail, education, travel, and fintech. With a dedicated development team model, a 2-week risk-free trial, and deep expertise in technologies like React.js, Node.js, Python, .NET, and React Native, Innostax co-creates breakthrough solutions that help founders, CTOs, and product leaders ship better software, faster. Learn more at innostax.com.

Kubernetes Operators: Custom Kubernetes Controller Guide

Table of Contents

What an Operator Actually Is (Not the Definition, the Reality)

Setting Up the Tools

Building a Real Operator, Step by Step

Step 1: Initialize the Project

Step 2: Define Your Custom Resource

Step 3: Generate the Scaffolding

Step 4: Write the Reconcile Function

Step 5: Build and Run

Applying Your First Custom Resource

Why This Pattern Actually Matters

Comments

More from this blog

Queueable Apex in Salesforce: What’s Holding Your Code Back?

Step Into Android: Your First App Tutorial for 2026

Prisma Nexus for GraphQL in Node.js

Digital Fingerprinting in 2026: How It Actually Stops Fraud (And Why We Had to Build Our Own)

Command Palette

Table of Contents

What an Operator Actually Is (Not the Definition, the Reality)

Setting Up the Tools

Building a Real Operator, Step by Step

Step 1: Initialize the Project

Step 2: Define Your Custom Resource

Step 3: Generate the Scaffolding

Step 4: Write the Reconcile Function

Step 5: Build and Run

Applying Your First Custom Resource

Why This Pattern Actually Matters

Comments

More from this blog