Jump to content

Data Platform/Data Lake/Traffic/Webrequest/April 2025 Varnish to HAProxy Migration

From Wikitech

April 2025 - Webrequest Varnish to HAProxy Migration -

The Data Engineering and Traffic team have been collaborating on migrating the Webrequest dataset from VarnishKafka to HAProxyKafka. We have extensively evaluated both data streams in parallel and assessed downstream impacts as much as possible. With this document we want to document observed changes between both data streams and their impact.

We have migrated the Webrequest dataset to feed from HAProxy on Tuesday Apr 1, 2025.

We encourage anyone depending on this data pathway to review and send feedback.

Background

The Traffic team has evolved WMF CDN infrastructure, making HAproxy the “the front-most L7 layer for most of our traffic”. This allows to move the source of the Webrequest analytics dataset  from Varnish to HAProxy, getting more accurate data. An effort has been made between the Traffic and Data engineering teams to provide an exactly matching dataset but some changes are expected, impacting not only Webrequest but also downstream metrics among which the top-level visibility metrics pageviews and unique-devices. Even if the change is ultimately an increase in precision for our metrics, we wish to provide detailed explanations for the variations.

TLS redirects, rate-limiting responses and few more rows are included in the new dataset

The redirection from http:// to https:// to secure traffic is happening at HAProxy level, hence we were missing those rows when the data was sourced from Varnish. Our analysis has shown a growth of approximately 2 times the number of 301 redirects, from less than 10 million rows to between 15 and 25 million rows (see TLS Redirects appendix).

Those new rows don’t directly impact the top-level metric as the redirections are not considered as pageviews. They can have an indirect impact on the automated-traffic detection though, as actors making a large number of redirects are considered automated. However our analysis has not shown any significant change due to the redirects to the automated-traffic detection.


Similarly, rate-limitting responses (HTTP-code 429) happen mostly at HAProxy level, and we now have a lot more rows of this form in the dataset. There is a lot of variability as to when we observe those so it’s not possible to provide an estimate for how many rows we’ll have, but we know those don’t affect pageviews nor unique-devices as the 429 response code is not considered for those metrics.

We also observe a small number (0.1%) of additional rows in the HAProxy dataset, mostly requests that HAProxy doesn’t manage to process correctly (HTTP response code -1, see Other codes present in HAProxy only appendix). Those requests don’t impact our metrics as their response-code is not a valid one.

Some uri_host and accept_language fields are changed/not normalized

In Varnish we try to maximize the number of requests served from the cache instead of from lower level content services (higher cache-hit rate). This leads to various changes made by Varnish on the request data, noticeably for our case some change/normalization of the uri_host and accept_language fields, to reduce their variability. The data from HAProxy gives us access to the more variable non-normalized values sent by the client, making the dataset more accurate in terms of client data.

The one big change is that Varnish uses the en.wikipedia.org domain to serve static resources (logos and other predefined images) and API calls for all other domains. This makes for a big change when comparing per uri_host counts from HAproxy vs Varnish: en.wikipedia.org ends up having a lot more rows in the Varnish dataset, while all other domains have some more rows in HAProxy and en.wikipedia.org has a lot less.

This doesn’t impact pageviews nor unique devices as static resources and API calls URLs are not considered pageviews.


The number of rows from HAProxy having a uri_host field not existing in the Varnish data is less than 0.1% of all webrequests (a few hundred thousands per hour in absolute). In addition, most of the “new” domains are capitalized versions of a domain or regular domains ending with a trailing dot, therefore the metrics-normalization takes care of it.

The not-normalized domains we’ll have in the webrequest dataset have no visible impact on our metrics.


The number of rows from HAProxy having a accept_language field not existing in the Varnish data or reciprocal is less than 0.1% of all webrequests (a few hundred thousands per hour in absolute). In addition, normalized versions of the accept_language in Varnish occur mostly from our analysis on static resources (logos and other predefined images), as well as api calls. Finally, the value of the accept_language field sent to us from HAProxy is cropped at 1024 characters, making a small number of rows not match the one sent from Varnish (not cropped).

The accept_language variations have no visible impact on our metrics despite changing the actor fingerprint we use for automated-traffic detection.

Few HTTP response-codes change

We have noticed that about 0.3% of rows, mostly with HTTP response-code 200 (other codes like 302, 202, 304 etc are also impacted but in very small numbers) are transformed. Their http_status field is set to 400 and their cache_status, content_type, accept_language and x_analytics fields are set to the dummy value -. The Traffic team is aware of this behavior (task T387451) and will work on reducing this if possible, but it’s nonetheless expected that some requests passed from HAProxy to Varnish end up in different states like this.

The good side of this change is that the client sees HAProxy response, not the varnish one, so our data is actually more accurate with the 400 responses than they were with the 200. The downside is that this impacts our metrics as our pageviews are extracted from rows having HTTP-response code 200.

We can see a pageview reduction of a bit less than the 0.3% change we see in webrequest, and same for the unique-devices. This ratio can be a lot bigger when slicing the data on dimensions where some values have a small number of requests, for instance domains or countries with low traffic.