Cache log format
The Wikimedia cache log format is a custom format which is used on webserver caches to provide logs in a uniform format. See also Apache log format for Apache logs.
Format used in cache config files
In general, missing fields are not empty, but get represented by -
.
Nginx Format
See squid_combined
format of templates/nginx/nginx.conf.erb of operations/puppet. As of 2014-01-25, we use
$hostname $udplog_sequence $udplog_time $request_time $remote_addr -/$status $bytes_sent $request_method $scheme://$host$request_uri NONE/$proxy_host $content_type $http_referer $http_x_forwarded_for $http_user_agent $http_accept_language $sent_http_x_analytics
Documentation of the format specifiers is available in the nginx documentation.
Squid Format
Setting the custom format for the logs relies on the logformat directive available in Squid 2.6. The %sn format code is custom to the Wikimedia Foundation squids. For more information about our squid log setup, see Squid logging.
$hostname %sn %{%Y-%m-%dT%H:%M:%S}tg.%03tu %tr %>a %Ss/%03Hs %<st %rm %ru %Sh/%<A %mt %{Referer}>h %{X-Forwarded-For}>h %{User-Agent}>h %{X-CS}>h
Documentation of the format specifiers is available in squid's logformat documentation.
Squids have been phased out, and the last line in Squid log format got logged on 2013-12-17T15:45:32.764.
Varnishncsa Format
See LOG_FORMAT
of modules/varnish/files/varnishncsa.default of operations/puppet. As of 2013-09-30, we use
%l %n %t %{Varnish:time_firstbyte}x %h %{Varnish:handling}x/%s %b %m http://%{Host}i%U%q - %{Content-Type}o %{Referer}i %{X-Forwarded-For}i %{User-agent}i %{Accept-Language}i %{X-Analytics}o
Documentation of the format specifiers is available in the varnishncsa documentation.
If a header is set multiple times (like two User-Agent
s) for a request, varnishncsa only logs the last one.
Invalid UTF-8 characters are copied byte-wise. So a line need not be a valid UTF-8 string.
Varnishkafka Format
See varnishkafka::instance
's format
of manifests/role/cache.pp of operations/puppet. As of 2015-01-10, we use
%{fake_tag0@hostname?${::fqdn}}x %{@sequence!num?0}n %{%FT%T@dt}t %{Varnish:time_firstbyte@time_firstbyte!num?0.0}x %{@ip}h %{Varnish:handling@cache_status}x %{@http_status}s %{@response_size!num?0}b %{@http_method}m %{Host@uri_host}i %{@uri_path}U %{@uri_query}q %{Content-Type@content_type}o %{Referer@referer}i %{X-Forwarded-For@x_forwarded_for}i %{User-Agent@user_agent}i %{Accept-Language@accept_language}i %{X-Analytics@x_analytics}o %{Range@range}i
Invalid UTF-8 characters get replaced by 0xEF
0xBF
0xBD
(i.e.: Unicode Replacement Character (Codepoint U+FFFD) in UTF-8).
Log entry
Log entries within a log file are separated by a newline character.
Log fields within a log entry are separated by tab character.
As some fields may (and sometimes do) contain tabs (e.g.: “User-Agent” sometimes comes with tabs), they may screw up column numbering. So make sure machinery to read those logs can handle that gracefully.
Fields
Field# in csvs | JSON key | Description | Details | Avg. len in bytes for November 2013 | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | hostname
|
Hostname of the cache | 22 | |||||||||||||||||||||
2 | sequence
|
Sequence number | Per host, this integer number should be strictly increasing by 1 per request. However, it overruns. It may be a negative number. | 10 | ||||||||||||||||||||
3 | dt
|
time stamp | The current time in ISO 8601 format, according to the cache's clock
|
19 | ||||||||||||||||||||
4 | time_firstbyte
|
Request service time |
|
10 | ||||||||||||||||||||
5 | ip
|
Client IP | You might want to override this IP with X-Forwarded-For header (see below). | 13 | ||||||||||||||||||||
6 | n/a | Cache request status, HTTP status code | caching information from this field isn't very useful regardless of if its coming from squid or varnish, on account of being logged from the frontend routing instance, despite most cache hits come from the backend. The field is mostly just useful for the status code, though varnish states like pass may be of interest at times as well.
Squid and varnish use different texts to represent the same situations. For example:
For November 2013, we saw 54.05% hits, 38.61% misses, 4.88% passes, 2.42% - for cache results, and for HTTP status codes we saw 82.82% 200, 8.12% 304, 7.72% 302, 0.49% 301, 0.39% 404, and 0.28% 206. See the |
8 | ||||||||||||||||||||
n/a | cache_status
|
Cache request status | (see notes in the “Cache request status, HTTP status code” row) | |||||||||||||||||||||
n/a | http_status
|
Http status code | (see notes in the “Cache request status, HTTP status code” row) | |||||||||||||||||||||
7 | response_size
|
Reply size |
Content-Encoding affects this number. So if the payload gets for example gzipped, it contains the size of the gzipped payload, not the uncompressed one. While in varnishncsa format, both |
3 | ||||||||||||||||||||
8 | http_method
|
Request method | GET , POST , etc
For November 2013, we saw 99.14% GET, 0.60% POST, 0.22% HEAD, and 0.04% other request methods. |
3 | ||||||||||||||||||||
9 | n/a | URL | See the uri_host , uri_path , and uri_query fields for the JSON representation of this field
|
112 | ||||||||||||||||||||
n/a | uri_host
|
Host of the requested URL | This field is the plain host without protocol. | |||||||||||||||||||||
n/a | uri_path
|
Path of the requested URL | ||||||||||||||||||||||
n/a | uri_query
|
Query of the requested URL | This field includes the leading ? .
|
|||||||||||||||||||||
10 | n/a | Cache hierarchy status, peer IP | 2 | |||||||||||||||||||||
11 | content_type
|
MIME content type | 14 | |||||||||||||||||||||
12 | referer
|
Referer header | 57 | |||||||||||||||||||||
13 | x_forwarded_for
|
X-Forwarded-For header | This field should only be respected if the client IP (see above) is a known good IP (i.e.: in $wgSquidServersNoPurge of wmf-config/squid.php of operations/mediawiki-config). | 3 | ||||||||||||||||||||
14 | user_agent
|
User-Agent header | This field may (but need not be) URL encoded. As of September 2013, about half of the lines are URL encoded for the Request stream. | 108 | ||||||||||||||||||||
15 | accept_language
|
Accept_Language | 16 | |||||||||||||||||||||
16 | x_analytics
|
X-Analytics | General purpose header for instrumentation, which received it's own page at X-Analytics. | 1 | ||||||||||||||||||||
n/a | range
|
Range | The Range header of the request.
|
Example
The client IP address has been replaced with 1.2.3.4 to protect the innocent:
Nginx example
ssl1001 129960997 2014-01-21T08:36:33.097 0.426 1.2.3.4 -/200 12324 GET https://en.wikipedia.org/w/index.php?title=Kirschkuchen&action=edit§ion=8 NONE/wikimedia - https://en.wikipedia.org/wiki/Kirschkuchen - Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.76 Safari/537.36 en-US,en;q=0.8 -
Squid example
sq18.wikimedia.org 1715898 2010-12-01T21:57:22.331 0 1.2.3.4 TCP_MEM_HIT/200 13208 GET http://en.wikipedia.org/wiki/Main_Page NONE/- text/html - - Mozilla/4.0%20(compatible;%20MSIE%206.0;%20Windows%20NT%205.1;%20.NET%20CLR%201.1.4322) en-US -
Varnishncsa example
cp1048.eqiad.wmnet 8883921154 2013-09-26T06:28:16 0.001308203 1.2.3.4 hit/200 52362 GET http://en.wikipedia.org/wiki/Free_software - text/html http://en.wikipedia.org/wiki/Freedom 4.3.2.1 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36 en-US,en;q=0.8 -
Varnishkafka example
{"hostname":"cp1067.eqiad.wmnet","sequence":396385264,"dt":"2015-01-10T13:45:21","time_firstbyte":0.000140428,"ip":"2.3.4.5","cache_status":"hit","http_status":"200","response_size":27564,"http_method":"GET","uri_host":"en.wikipedia.org","uri_path":"/wiki/Foo","uri_query":"","content_type":"text/html; charset=UTF-8","referer":"http://en.wikipedia.org/wiki/Bar","x_forwarded_for":"-","user_agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko","accept_language":"en-US","x_analytics":"php=hhvm","range":"-"}
Usage
This format is used by the following streams:
So if you (plan to) change/update the format, please notify the owners of those streams.
Formerly (until around 2015):
Issues
Some issues that you should be aware of when you use these files.
A small number of requests log the host twice, for example:
http://en.m.wikipedia.orghttp://en.m.wikipedia.org/favicon.ico
Why does this happens and why this is not a problem is explained is this bug: [1]