Mw-mcrouter
This is the daemonset proxying all mediawiki memcached requests, to our memcached cluster. It is running the almighty mcrouter
mcrouter image and exporter
Image is in the production images repo, where the defaults are set.
Image version in production is defined in the puppet repo under profile::kubernetes::deployment_server::general:common_images
Daemonset
mw-mcrouter is running as a daemonset, i.e. every k8s node is running an instance of it. This includes the dedicated kubernetes nodes running kask, thus there mw-mcrouter pods with no traffic at all.
Service
mw-mcrouter
is using the mcrouter chart. Notable keys in values.yaml
:
cache:mcrouter:public_service: true
enables mcrouter as a standalone serviceservice:use_node_local_endpoints: true
routes requests to the node-local endpoint of a podcache:mcrouter:service:clusterIP
Static IP (per DC) where the service listens, as defined in Kubernetes/Service_ips- eqiad ClusterIP: 10.64.72.12
- codfw ClusterIP: 10.192.72.12
values-eqiad.yaml
cache:
mcrouter:
service:
clusterIP: 10.64.72.12
enabled: true
route_prefix: eqiad/mw
zone: eqiad
routes:
- route: /eqiad/mw
pool: eqiad-servers
failover_time: 600
- route: /codfw/mw
pool: codfw-servers
failover_time: 600
- route: /eqiad/mw-wan
failover_time: 600
pool: eqiad-servers
replica:
route: /codfw/mw-wan
pool: codfw-servers
Deployment
In the mcrouter chart, in daemonset.yaml, this ds is configured to update one pod at a time.
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
During an mw-mcrouter deployment:
- Generally, mcrouter's configuration or image are rarely in need for an update
- In case of new image being pulled, deployment may take as long as 15'
- This is due to
maxUnavailable: 1
above
- This is due to
- Alerts for elevated mw-memcached errors
- Alerts for mw-mcrouter helmfile being in a bad state
- All the alerts above will clear
- If deployment is stuck due to eg a node having insufficient resources to host the mcrouter
Testing changes
The safest way to test changes in mcrouter is to switch the mw-debug mediawiki deployment to use the in-pod mcrouter container. This is described in the next section
Switching Mediawiki to in-pod mcrouter container
While it sounds complicated, it is not. To switch mw-debug
in eqiad to use the in-pod container, the following stanza must be added:
mw-debug/values-eqiad.yaml
cache:
mcrouter:
enabled: true
route_prefix: eqiad/mw
zone: eqiad
routes:
- route: /eqiad/mw
pool: eqiad-servers
failover_time: 600
- route: /codfw/mw
pool: codfw-servers
failover_time: 600
- route: /eqiad/mw-wan
pool: eqiad-servers
failover_time: 600
replica:
route: /codfw/mw-wan
pool: codfw-servers
# Wikifunctions routes, omitted in production
# - route: /local/wf
# pool: wf-eqiad
# # No failover for wikifunction
# failover_time: 0
#
# use only if testing new images
# common_images:
# mcrouter:
# mcrouter: mcrouter:2023.07.17.00-1-20240714
# exporter: prometheus-mcrouter-exporter:0.0.1-3-20240714
php:
envvars:
MCROUTER_SERVER: "127.0.0.1:11213"
STATSD_EXPORTER_PROMETHEUS_SERVICE_HOST: false
# MCROUTER_SERVER: "10.64.72.12:4442" # mcrouter-main.mw-mcrouter.svc.cluster.local
Troubleshooting
When dealing with Kubernetes, your answer may be found in Kubernetes kubectl Cheat Sheet
Memcached server is down
That is ok, the gutter pool will pick up its traffic.
Memcached Gutter Pool server is down, and we need the Gutter Pool
In this case:
- the gutter pool server in question MUST be removed from the configuration
- Merge in puppet
- Run puppet on the active deployment server
- Deploy mw-mcrouter
Deployment is stalled
If you are deploying a new version of the daemonset, but you see pods stuck in the previous version, and elevate mw-memcached errors:
- Check the daemonset's status
jiji@deploy1002:~$ kubectl get ds
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
mcrouter-main 210 210 210 210 210 <none> 30d
- Check events and pods to find which node may be stalling the rollout
- If it is a resource problem (e.g. insufficient CPU), you may kill a random pod from the node in question (as root)
kube_env admin eqiad
kubectl -n mw-mcrouter get events --sort-by=.metadata.creationTimestamp
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=wikikube-worker1001.eqiad.wmnet
kubectl -n mw-api-ext delete po mw-api-ext.eqiad.main-koko-lala