Jump to content

Event Platform/EventStreams/Administration

From Wikitech

See EventStreams for an overview of the EventStreams service.

EventStreams is a service-template-node based service. It glues together KafkaSSE with common Wikimedia service features, like logging, error reporting, metrics, configuration and deployment.

Internally, EventStreams is available at eventstreams.svc.${::site}.wmnet. It is routed to by varnish and LVS from stream.wikimedia.org.

EventStreams in production is configured and deployed using WMFs Deployment pipeline.

Configuration

EventStreams is configured in the operations/deployment-charts repository. Configuration is spread between the defaults in charts/eventstreams and the production specific configuration in helmfile.d.

helmfile.d values.yaml files contain the list of allowed_streams that EventStreams will expose. The actual Kafka topics are retrieved from EventStreamConfig. Our event topics are prefixed by datacenter name. This is abstracted for EventStreams consumers via this mapping.

Kafka

EventStreams is backed by the main Kafka clusters. As of 2018-08, EventStreams is multi-DC capable. EventStreams in eqiad consumes from the Kafka main-eqiad cluster, and EventStreams in codfw consumes from the Kafka main-codfw cluster. Kafka MirrorMaker is responsible for mirroring the topics from eqiad to codfw and vice versa.

NodeJS Kafka Client

KafkaSSE uses node-rdkafka (as do other production NodeJS services that use Kafka).

Repositories

Repository Description
KafkaSSE (github) Generic Kafka Consumer -> SSE NodeJS library.
eventstreams (github) EventStreams implementation using KafkaSSE and service-template-node.
operations/deployment-charts Helm chart repository for all production Kubernetes based services, including EventStreams.

Deployment

EventStreams is deployed on Kubernetes, with a deployment also on the beta cluster.

Here is the runbook (assuming you have permission for everything):

  1. Merge patch to main in GitLab
  2. Run the trigger_release job under the deploy stage to increment the version number and build an image to the docker registry
    1. If you don't want to release a new version, you can run the build-and-publish-feature-branch job and it will build an image to the docker registry with the name outputted to the console.
  3. Update the image name under the main_app.version key in the deployment-charts repo under helmfile.d/services/eventstreams/values.yaml and helmfile.d/services/eventstreams-internal/values.yaml
  4. ssh deployment.eqiad.wmnet
eventstreams eventstreams-internal
cd /srv/deployment-charts/helmfile.d/services/eventstreams cd /srv/deployment-charts/helmfile.d/services/eventstreams-internal
kube_env eventstreams staging

helmfile -e staging -i apply --context 5

kube_env eventstreams-internal staging

helmfile -e staging -i apply --context 5

kube_env eventstreams codfw

helmfile -e codfw -i apply --context 5

kube_env eventstreams-internal codfw

helmfile -e codfw -i apply --context 5

kube_env eventstreams eqiad

helmfile -e eqiad -i apply --context 5

kube_env eventstreams eqiad

helmfile -e eqiad -i apply --context 5

Check by tunnelling: ssh -N -L4992:eventstreams-internal.discover.wmnet:4992 bast1003.wikimedia.org and going to localhost:4992

See also: Deployments on kubernetes

Submitting changes

Change to KafkaSSE library

KafkaSSE is hosted in GitLab, so you must either submit a pull request or push a change there.

kafka-sse is an npm dependency of EventStreams.

If you update kafka-sse, you should bump the package version using the trigger_release job, which published an npm package to the project's package registry.

Change to mediawiki/services/eventstreams repository

EventStreams is hosted in gitlab. Use merge requests to submit a change there. If you've modified the KafkaSSE repository, you should update the kafka-sse dependency version in package.json. Merged changes in this repository will result in a new Docker image being built.

To test locally:

# Build the new Docker image
DOCKER_BUILDKIT=1 docker build --target production -f .pipeline/blubber.yaml --platform=linux/amd64 . -t es

# Run the Docker image locally. The UI will be available at http://localhost:8092/
docker run --rm -it -v $(pwd):/es -w /es --net=host es

You can also test on deployment-prep (go in horizon.wikimedia.org -> deployment-prep -> filter for "event"), the Docker image to set is stated in the VM's puppet configuration. Once updated, run puppet on the Event Streams Beta VM and check that everything is fine (you can also use docker logs etc.. if you want).

Finally, the result will be displayed in https://stream-beta.wmflabs.org/

If you wan to see an event displayed for mediawiki.revision-create, you can edit https://en.wikipedia.beta.wmflabs.org/wiki/Polar_bear

For details on how to access and work with deployment-prep refer to Event Platform/Beta Deployments.

Update operations/deployment-charts repository

Once a new Docker image has been made, you'll need to update the image_version in helmfile.d eventstreams values.yaml.

Deploy

See: Deployments_on_kubernetes#Code_deployment/configuration_changes and Event Platform/Beta Deployments

Logs

Logs are sent to logstash. You can view them in Kibana.

Metrics

https://grafana.wikimedia.org/d/znIuUcsWz/eventstreams

Throughput limits

As of 2019-07, The public EventStreams stream.wikimedia.org endpoint is configured in varnish to only allow for 25 concurrent connections per varnish backend. There are 10 text varnishes in codfw and 8 in eqiad, so the varnish concurrent connection limit for EventStreams is 200 in eqiad and 250 in codfw for a total of 450 concurrent connections. We have had incidents where a rogue client spawns too many connections. EventStreams code has some primitive logic to try to reduce the number of concurrent connections from the same X-Client-IP, but this will not fully prevent the issue from happening. Check the total number of connections in https://grafana.wikimedia.org/dashboard/db/eventstreams if new connections receive a 502 error from varnish.

Alerts

EventStreams is configured with a monitoring check that will check that the /v2/stream/recentchange URL has data on it. This check is done to the public stream.wikimedia.org endpoint. If this public check fails, then likely all backend service processes have the same issue.

Incidents