🔄 Someone kubectl apply'd a Hotfix Directly. How Do You Detect and Prevent It?

The question

“How do you prevent configuration drift in a Kubernetes cluster?”

Configuration drift: the cluster’s actual state diverges from what’s declared in your source of truth. Someone runs kubectl edit deployment myapp to bump a memory limit during an incident. Someone adds a debug sidecar directly. Someone applies a YAML file from their laptop that was never committed to Git. The fix works. It goes undocumented. Six months later, a new deployment overwrites it. The incident recurs.

There are two distinct problems here that require different solutions:

Detection and remediation: how do you notice drift and revert it?
Prevention: how do you stop non-compliant resources from being created in the first place?

Detection and remediation: Argo CD selfHeal

If you’re using GitOps with Argo CD, detection and remediation are handled for you:

syncPolicy:
  automated:
    prune: true
    selfHeal: true

selfHeal: true means Argo CD continuously compares the cluster state to the Git repo and reverts any divergence. Someone runs kubectl edit deployment myapp and changes the replica count? Argo CD detects the diff on its next reconciliation cycle (default: every 3 minutes) and reverts it.

prune: true means resources that exist in the cluster but not in Git are deleted. Someone kubectl apply’d a debug pod directly? Gone on the next sync.

This is the audit trail story too. Every legitimate change is a Git commit with an author, a timestamp, and a commit message. Everything that isn’t in Git doesn’t survive past the next reconciliation. If you want to know what changed and when, git log is the answer.

The gap selfHeal doesn’t close

selfHeal reverts drift after the fact. There’s a window — up to 3 minutes — where a drifted resource is serving traffic. For most changes, that’s fine. For a bad resource (wrong RBAC, missing network policy, container running as root), 3 minutes is enough to be a problem.

The other gap: selfHeal doesn’t tell you who made the change or generate an alert. It just silently fixes it. You need audit logging (kube-apiserver --audit-log-path) or an alerting rule on Argo CD’s health events to know that drift happened.

Prevention: Kyverno

Kyverno is a policy engine that runs as a Kubernetes admission webhook. Every resource creation or modification goes through it before being persisted. If the resource violates a policy, Kyverno can reject it outright (enforce mode) or allow it with a warning (audit mode).

The policies are Kubernetes resources themselves — they live in Git, they’re applied via GitOps, they’re versioned. No separate policy language to learn.

A policy that requires readiness probes on all Deployments:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-readiness-probe
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-readiness-probe
      match:
        any:
          - resources:
              kinds:
                - Deployment
      validate:
        message: "Deployments must define a readiness probe."
        pattern:
          spec:
            template:
              spec:
                containers:
                  - (name): "*"
                    readinessProbe:
                      (httpGet | tcpSocket | exec): "*"

With this policy active: kubectl apply -f deployment-without-probe.yaml is rejected at the API server. The error message is the one you defined in message. The deployment never reaches etcd.

A policy that blocks containers running as root:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-root-containers
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-runAsNonRoot
      match:
        any:
          - resources:
              kinds: [Deployment, StatefulSet, DaemonSet]
      validate:
        message: "Containers must not run as root."
        pattern:
          spec:
            template:
              spec:
                containers:
                  - (name): "*"
                    securityContext:
                      runAsNonRoot: true

A policy that enforces resource limits (common in multi-tenant clusters):

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-limits
      match:
        any:
          - resources:
              kinds: [Deployment]
      validate:
        message: "CPU and memory limits are required."
        pattern:
          spec:
            template:
              spec:
                containers:
                  - resources:
                      limits:
                        memory: "?*"
                        cpu: "?*"

Kyverno can also mutate and generate

Policies aren’t only for validation. Kyverno can mutate incoming resources (add default labels, inject sidecars, set default resource requests) and generate new resources in response to events (create a NetworkPolicy whenever a new namespace is created).

Auto-add a standard label to every Deployment:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: add-labels
spec:
  rules:
    - name: add-team-label
      match:
        any:
          - resources:
              kinds: [Deployment]
      mutate:
        patchStrategicMerge:
          metadata:
            labels:
              managed-by: kyverno

Auto-create a default NetworkPolicy when a namespace is created:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: add-default-networkpolicy
spec:
  rules:
    - name: default-deny
      match:
        any:
          - resources:
              kinds: [Namespace]
      generate:
        kind: NetworkPolicy
        name: default-deny-all
        namespace: "{{request.object.metadata.name}}"
        data:
          spec:
            podSelector: {}
            policyTypes:
              - Ingress
              - Egress

The complete drift prevention picture

Developer runs: kubectl apply -f bad-deployment.yaml
  → API server receives request
  → Kyverno admission webhook intercepts
  → Policy check: no readiness probe → Rejected
  → API server returns 403 with Kyverno's message
  → Resource never reaches etcd

Developer runs: kubectl edit deployment myapp (valid change, just not via Git)
  → Edit succeeds (no policy violation)
  → Argo CD reconciliation fires (within 3 minutes)
  → Diff detected: cluster state ≠ Git state
  → selfHeal: revert to Git state
  → If audit logging enabled: event recorded with username and timestamp

Git is the audit trail for what should be there. kube-apiserver audit logs are the trail for what was attempted. Kyverno is the enforcer at admission time. Argo CD is the continuous reconciler. Four layers, each with a different job.

What interviewers are actually testing

The follow-up is usually: “What’s the difference between Kyverno and OPA Gatekeeper?”

Both are admission webhook policy engines. The practical differences:

Kyverno: policies are k8s-native YAML, no separate language to learn. Generate and mutate policies built in. Easier to get started with.
OPA Gatekeeper: policies are written in Rego, a purpose-built policy language that’s more expressive but has a steeper learning curve. Better if you’re already using OPA elsewhere (Terraform, microservice authorization).

For a Kubernetes-only environment, Kyverno is the pragmatic choice. For a platform team that uses OPA across the stack, Gatekeeper gives you policy consistency.

The deeper follow-up: “How do you test policies before enforcing them?” Use Audit mode first (validationFailureAction: Audit). Violations are logged as PolicyReport objects but requests aren’t rejected. Review the reports, fix the existing violations, then switch to Enforce. Never flip directly to Enforce in production — you’ll break things that were already running.

This is part of a series on Kubernetes interview questions. Previously: network isolation between services.

The question#

Detection and remediation: Argo CD selfHeal#

The gap selfHeal doesn’t close#

Prevention: Kyverno#

Kyverno can also mutate and generate#

The complete drift prevention picture#

What interviewers are actually testing#