Jump to content

Portal:Toolforge/Admin/Runbooks/Kyverno

From Wikitech

This page contains runbooks for dealing with Kyverno problems.

The procedures in this runbook require admin permissions to complete.

Kyverno is in the hot path for user workload scheduling in Toolforge. Every pod created by tool accounts will be evaluated by Kyverno in an admission webhook, where it will validate/mutate the resource being created.

It is therefore imperative that Kyverno is up and running at all times.

Kyverno policies are created by maintain-kubeusers for every tool account.

What happens if Kyverno is down or if no Kyverno policies are present or they are not READY

Kyverno is currently configured in fail-closed mode, meaning that if it is down, policies wont evaluate and the admission webhook will reject new user workload creation.

Per our configuration, a Kyverno policy must evaluate correctly for a tool account workload to be admitted into the cluster.

Therefore if either:

  • Kyverno is down
  • No Kyverno policy exists for a given tool account namespace
  • A Kyverno policy exists in the tool account namespace, but it is not in READY state

The result is the same: No new tool workload (pods) will be allowed to run in the cluster.

How to fix it

If Kyverno is down, try any of the following:

  • recreate the pods manually from Toolforge k8s control nodes. TODO: put the actual command here.
  • redeploy it from scratch, using the toolforge-deploy repository. TODO: put the actual command here.
  • verify resources (RAM, CPU) of the cluster and key components. Kyverno can be very resource intensive, for both itself and other Kubernetes components (apiserver, controller-manager, etcd, etc). TODO: put here some commands.

If policies are not present, try any of the following:

  • restart maintain-kubeusers. TODO: put the actual command here.
  • redeploy maintain-kubeusers. TODO: put the actual command here.

If policies are present, but not in READY state:

  • verify that Kyverno pods are running correctly
  • verify that Kyverno pods have enough resources allocated to them
  • verify that the Kubernetes control plane is healthy, resource-wise, etc.

How to remove Kyverno from the hot path (don't do this unless extreme emergency)

In case of extreme emergency, we can remove Kyverno from the hot path, thus allowing every tool user workload to be admitted into the cluster without policy verification.

This is an extreme risk security-wise, and you should never do this unless there is a major outage happening.

Run this on a Toolforge k8s control node to disable Kyverno admission configuration:

sudo -i kubectl delete validatingwebhookconfiguration kyverno-resource-validating-webhook-cfg
sudo -i kubectl delete  mutatingwebhookconfiguration kyverno-resource-mutating-webhook-cfg

Run this to stop the main Kyverno daemon from running:

sudo -i kubectl scale deploy kyverno-admission-controller -n kyverno --replicas 0

Error / Incident

Information about some specific alerts we have:

Toolforge Kyverno unknown state

This means Prometheus was unable to fetch one of the main metrics of Kyverno. It can mean Kyverno is down.

See section above to know what happens if Kyverno is down and how to fix it.

Toolforge Kyverno low policy resources

This means we have a surprisingly low number of policy resources loaded into the cluster. It may mean some kind of misconfiguration or error in maintain-kubeusers, or Kyverno is having a hard time reconciling policies into READY status.

See section above to know what happens if Kyverno is down and how to fix it.

Toolforge Kyverno no policy resources

This means no policy resources were loaded into the cluster. It may mean some kind of misconfiguration or error in maintain-kubeusers, or Kyverno not running at all.

See section above to know what happens if Kyverno is down and how to fix it.

Debugging

Some debugging information.

How to see state of Kyverno pods

Run this:

user@tools-k8s-control-7:~$ sudo -i kubectl -n kyverno get pods
NAME                                                       READY   STATUS      RESTARTS   AGE
kyverno-admission-controller-5b9779d5c6-2zsrg              1/1     Running     0          17d
kyverno-admission-controller-5b9779d5c6-59f2p              1/1     Running     0          17d
kyverno-admission-controller-5b9779d5c6-6jk78              1/1     Running     0          17d
kyverno-admission-controller-5b9779d5c6-7fbcd              1/1     Running     0          17d
kyverno-admission-controller-5b9779d5c6-fptrv              1/1     Running     0          17d
kyverno-admission-controller-5b9779d5c6-ljdkl              1/1     Running     0          17d
kyverno-admission-controller-5b9779d5c6-sg5vg              1/1     Running     0          17d
kyverno-background-controller-5d6bc965bd-bjk6d             1/1     Running     0          17d
kyverno-background-controller-5d6bc965bd-nnnj4             1/1     Running     0          17d
kyverno-cleanup-admission-reports-28679960-2k4dp           0/1     Completed   0          6m10s
kyverno-cleanup-cluster-admission-reports-28679960-kkfh5   0/1     Completed   0          5m56s
kyverno-cleanup-controller-9bccdf4d6-5sgwk                 1/1     Running     0          18d
kyverno-cleanup-controller-9bccdf4d6-zh8dp                 1/1     Running     0          17d
kyverno-reports-controller-8849f9684-48xcz                 1/1     Running     0          17d
kyverno-reports-controller-8849f9684-trrr6                 1/1     Running     0          17d
user@tools-k8s-control-7:~$ sudo -i kubectl -n kyverno get deploy
NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
kyverno-admission-controller    7/7     7            7           32d
kyverno-background-controller   2/2     2            2           32d
kyverno-cleanup-controller      2/2     2            2           32d
kyverno-reports-controller      2/2     2            2           18d


How to see the state of maintain-kubeusers

This is how you can get information about pods for maintain-kubeusers:

user@tools-k8s-control-7:~$ sudo -i kubectl -n maintain-kubeusers get pods
NAME                                 READY   STATUS    RESTARTS         AGE
maintain-kubeusers-dc9d6978b-nthbw   1/1     Running   20 (3m44s ago)   2d

To see logs:

user@tools-k8s-control-7:~$ sudo -i kubectl -n maintain-kubeusers logs deploy/maintain-kubeusers --timestamps=true

In case you want to check the logs for a previous container restart:

user@tools-k8s-control-7:~$ sudo -i kubectl -n maintain-kubeusers logs deploy/maintain-kubeusers --timestamps=true --previous

See also Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown

How to see the policy resources

To check the policies, run this:

user@tools-k8s-control-7:~$ sudo -i kubectl get policy -A
NAMESPACE                                   NAME                           BACKGROUND   VALIDATE ACTION   READY   AGE     MESSAGE
tool-a-list-bulding-tool                    toolforge-kyverno-pod-policy   true         Enforce           True    29d     Ready
tool-aaabot                                 toolforge-kyverno-pod-policy   true         Enforce           True    29d     Ready
tool-aalertbot                              toolforge-kyverno-pod-policy   true         Enforce           True    29d     Ready
tool-abbe98tools                            toolforge-kyverno-pod-policy   true         Enforce           True    29d     Ready
tool-abbreviso                              toolforge-kyverno-pod-policy   true         Enforce           True    29d     Ready
tool-abcgames                               toolforge-kyverno-pod-policy   true         Enforce           True    29d     Ready
tool-abdumubot                              toolforge-kyverno-pod-policy   true         Enforce           True    29d     Ready
tool-abibot                                 toolforge-kyverno-pod-policy   true         Enforce           True    29d     Ready
tool-abigor                                 toolforge-kyverno-pod-policy   true         Enforce           True    29d     Ready
[..]

To see how many of them are in READY status, run this:

user@tools-k8s-control-7:~$ sudo -i kubectl get policy -A | grep Ready | wc -l
3318

To see how many of them are not READY, run this:

user@tools-k8s-control-7:~$ sudo -i kubectl get policy -A | grep -v Ready | wc -l

Common issues

Add new issues here when you encounter them!

See upstream docs:

Old incidents

Old incidents related to Kyverno: