Jump to content

SLO/logstash

From Wikitech
< SLO

SLO Worksheet - Logstash

Service

Logstash is a free and open server-side data processing pipeline that ingests data from multiple sources, transforms it, and then outputs it for search. In our infrastructure Logstash is a component of the logging pipeline, which consists of Kafka -> Logstash -> OpenSearch <- OpenSearch Dashboards.

Teams

Logstash is owned by the SRE Observability team, which is responsible for operation, scalability, and software updates. Contact: sre-observability@wikimedia.org and https://office.wikimedia.org/wiki/Contact_list#Observability

Architectural

Logstash consists of two clusters per-site.

  • A production cluster which consumes logs from Kafka, transforms them, and outputs to OpenSearch.
  • A barebones legacy cluster which ingests logs directly via TCP/UDP and outputs them to Kafka for consumption by the production cluster.

Hard Dependencies

  • OpenSearch - This is where log data is stored, logstash will block if OpenSearch becomes unavailable.
  • Kafka - Logstash ingests log message from the kafka-logging cluster.
  • Hardware - Both dedicated servers, Ganeti instances, and networking.

Soft Dependencies

none

Client-facing

Clients

software use connection interval failure mode
(Logstash down)
Kafka Aggregates and queues log messages for consumption by logstash Pull via TCP Continuous Kafka consumer lag will spike and alarm
OpenSearch Storage/archival of log data for search Push via TCP Continuous Logstash will block and stop consuming log events, Kafka consumer lag will spike and alarm.
SCAP pre-flight error checks to support deployments Pull via TCP using logstash_checker.py in puppet Manual False negative/positive result during deploy pre-flight deploy check

Service Level Indicators (SLIs)

Errors - Percentage of logs which fail to be indexed by OpenSearch

Availability - Percentage of time Logstash is handling logs minute-to-minute

Monitoring

Logstash is monitored via a suite of health checks and metrics, including:

  • Icinga checks - Host based service up/down checks
  • Kafka consumer lag - Is Logstash able to consume logs from the Kafka queue faster (or as fast as) they appear, or is the Kafka queue growing faster than Logstash (and OpenSearch) can process?
  • OpenSearch indexing failures - Is Logstash able to output events to OpenSearch, or do a significant number of log messages fail to be stored in OpenSearch
  • Logstash event rate today vs. yesterday - Is the overall log volume significantly higher or lower than 24h ago?

Deployment

Logstash is installed via Debian package and its configuration is deployed via puppet.

Service Level Objectives

  • Errors - 99.5% of events are indexed successfully, per datacenter. Log producers may emit invalid log messages which cannot be parsed and are dropped, producers may exceed rate limits, or output excessive amounts of logs that cannot be reasonably ingested
  • Availability - 99.95% of the time, per datacenter, Logstash is operational and actively processing logs