Logs/Runbook
Introduction
This page outlines some useful techniques that can be used to help diagnose issues with Wikimedia sites, based on the various application and infrastructure logs available.
The choice of which technique(s) to employ will largely depend on the nature of the situation.
Ad-Hoc Analysis
Sometimes it is useful to be able to perform ad-hoc analysis of a real-time incident, by viewing a live log file of certain events and filtering it according to your needs. The following examples may be adapted to your specific requirements.
Webrequests Sampled
Superset dashboards
There are two useful dashboards available in Superset for analyzing webrequests:
- Live data (from now until 24h ago): https://superset.wikimedia.org/superset/dashboard/webrequest-live/
- Historical data (from ~1h ago to 1 month ago): https://superset.wikimedia.org/superset/dashboard/webrequest-128/
The dashboards are made of different tabs with different data, see the Help tab for more details on how to use them.
Log files
The log file with sampled webrequests is available on centrallog1001 and centrallog2002 in the file: /srv/log/webrequest/sampled-1000.json
As the name suggests 1 in 1,000 requests are extracted from the stream in Kafka and are retained in this file. Each file contains one day's logs and 62 day's worth of old logs are stored in /srv/log/webrequest/archive
Aggregated data
On the hosts with OS bullseye (centrallog2002
as of Nov. 2022) or newer, there is a json-webrequests-stats
script installed to gather statistics from the logs.
$ json-webrequests-stats -h
usage: json-webrequests-stats [-h] [-n NUM] [-c {text,upload,all}] [-q [QUERIES ...]]
Script to parse sampled-1000.json or 5xx.json logs and report aggregated results.
The script expects as standard input a subset of lines from /srv/weblog/webrequest/*.json logs.
It uses the Python library `gjson` [1] that is the Python porting of the Go library GJSON [2] and accepts the same
syntax [3] for manipulating JSON objects.
Example usage:
# Get stats from the live traffic
tail -n 100000 /srv/weblog/webrequest/sampled-1000.json | json-webrequests-stats
# Save the current live traffic to work on the same set of data while refining the search
tail -n 100000 /srv/weblog/webrequest/sampled-1000.json > ~/sampled.json
cat ~/sampled.json | json-webrequests-stats
# There is some interesting traffic with a specific pattern, filter by it and get the statistics relative to only
# that specific traffic
cat ~/sampled.json | json-webrequests-stats -n 20 -c text -q 'uri_path="/w/api.php"'
# Apply multiple filters to narrow down the search
cat ~/sampled.json | json-webrequests-stats -n 20 -c text -q 'uri_path="/w/api.php"' 'user_agent%"SomeBot.*"'
# Get stats from the live 5xx error logs
tail -n 10000 /srv/weblog/webrequest/5xx.json | json-webrequests-stats
optional arguments:
-h, --help show this help message and exit
-n NUM, --num NUM How many top N items to return for each block. (default: 10)
-c {text,upload,all}, --cdn {text,upload,all}
For which CDN to show the stats. (default: all)
-q [QUERIES ...], --queries [QUERIES ...]
A GJSON additional array query to use to pre-filter the data, without the parentheses required
by the GJSON syntax [4], e.g.: -q 'uri_path="/w/api.php"'. Accepts multiple values, e.g.:
-q 'uri_path="/w/api.php"' 'uri_host="en.wikipedia.org"'. (default: None)
[1] Python gjson: https://volans-.github.io/gjson-py/index.html
[2] Go GJSON: https://github.com/tidwall/gjson/blob/master/README.md
[3] GJSON Syntax: https://github.com/tidwall/gjson/blob/master/SYNTAX.md
[4] GJSON Query syntax: https://github.com/tidwall/gjson/blob/master/SYNTAX.md#queries
The script provides the aggregated data (top N) for a set of requests fields and also the aggregated data obtained summing the response_size
(how much bandwidth) and the time_firstbyte
(how much mediawiki time) for each unique value of the fields and reporting the top N of them.
This is an example output, with IPs and UAs redacted and using the -n 3 CLI
argument for brevity.
Once a specific pattern is found, is possible to filter the data using the -q/--queries
CLI argument to get the statistic of only the filtered traffic, that usually allows to find more specific patterns (UA, IPs, etc.) fairly quickly.
Grep-able oputput
$ jq -r "[.uri_path,.hostname,.user_agent,.ip] | @csv" /srv/log/webrequest/sampled-1000.json
Select all public_cloud nets with 429
$ tail -n10000 /srv/weblog/webrequest/sampled-1000.json | jq -r 'select(.http_status == "429") | select(.x_analytics | contains("public_cloud=1"))'
Select all requests with a specific user_agent and .referer
$ jq -r 'if .user_agent == "-" and .referer == "-" then [.uri_path,.hostname,.user_agent,.ip] else empty end | @csv' /srv/log/webrequest/sampled-1000.json
List of the top 10 IPs by response size
$ jq -r '.ip + " " + (.response_size | tostring)' /srv/log/webrequest/sampled-1000.json| awk '{ sum[$1] += $2 } END { for (ip in sum) print sum[ip],ip }' | sort -nr | head -10
Select logs matching specific HTTP status, datestamp prefix, host, and uri_path, outputting the top query parameters found
$ tail -n300000 /srv/weblog/webrequest/sampled-1000.json| jq -r 'select(.http_status == "429") | select(.dt | contains("2022-06-10T14:5")) | select(.uri_host == "www.wikipedia.org") | select(.uri_path == "/") | .uri_query' | sort | uniq -c | sort -gr | head
5xx errors
most of the queries for the sampled-1000 log would work here as well
Grepable
$ tail -f /srv/log/webrequest/5xx.json | jq "[.uri_host, .uri_path, .uri_query, .http_method, .ip, .user_agent] | @csv"
Mediawiki
all ips which have made more the 100 large requests
$ awk '$2>60000 {print $11}' /var/log/apache2/other_vhosts_access.log | sort | uniq -c | awk '$1>100 {print}'
Retrospective Analysis
When the situation calls for analysis of more historical data, or to access the complete set of data, the Analytics Systems can help.
Turnilo
Turnilo has access to the [[1]] dataset, which is loaded every hour to Druid. As the name suggests, this samples 1 in 128 requests.
Data Lake
The primary source for webrequest logs is the Data Lake and the Analytics/Data Lake/Traffic/Webrequest tables in Hive.
These tables are updated hourly and may be queried using Hive, Presto, or Spark.
Please see https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest#Sample_queries for some sample queries using Hive.