User:Joal/WDQS Traffic Analysis

Analysis of WDQS traffic on public and internal clusters. The charts and data have been computed on Jupyter notebooks running Spark on the Analytics hadoop cluster. Processed data is events sourced through the Modern Event Platform. Originally written on March 2020, reran with June 2020 for the current version of the charts.

Global traffic information

HTTP response codes

Most requests on internal and external traffic generate HTTP result-code 200 (success). The rest of this analysis consider only requests having ended in a 200 response code, except explicitly stated otherwise.

Note: The scale of the two charts are different - See the next section for a comparison of number of requests among clusters.

Public vs Internal

The number of 200 requests to the public cluster is about half of the number of requests to the internal one.

WDQS public and internal clusters requests' for June 2020

Distinct queries

The number of daily distinct queries for the public cluster is about 61% of the total number of queries for June 2020, and 27% for the internal cluster. This means that each request is repeated on average 0.65 times per day for the public cluster, and 2.75 times per day for the internal one.

Query-time

One of the reasons for which analyzing query-time is interesting is as a proxy for resource-usage in the backend system: a long query is supposedly using more computation resource than a fast one.

Public vs Internal

Despite serving about 2 times more requests than the public cluster, the internal cluster has a daily sum of query-time about 10 times smaller than the public cluster.

WDQS public and internal clusters sum of query-time for June 2020

Query-time classes

It is interesting to note that for the public cluster, the requests taking more than 10s represent a very small number of requests and take most of the processing time.

Note: These charts are generated using Google Docs as the chart-system used in notebooks doesn't feature dual-axis charts.

Correlations (or not)

For the internal cluster, the sum of query-time is visually strongly correlated to the number of requests done (known query-class performing at regular speed). For the public cluster, there is no such correlation, due to the variety of query-classes (and implementations). Similarly, there is no visually-noticeable correlation between query-time and request-length (number of characters in the request), meaning that the query-length is not a good enough predictor of query complexity.

WDQS public cluster sum of query-time per requests count for every day of June 2020	WDQS internal cluster sum of query-time per requests count for every day of June 2020
WDQS public cluster sum of query-length (point size) per sum of query-time per requests count for every day of June 2020

User agents

In this section some log-scale have been applied to facilitate readability of small value. Look at the scales!

Public vs Internal

For the public cluster, the number of daily distinct user-agents is quite variable, with most values between 20,000 and 30,000. User-agents querying the internal cluster are the expected ones, namely the WikibaseQualityConstraints tools for Wikidata and commonswiki and two tools checking that the service is up.

Public cluster requests-count classes

On the public cluster, the number of user-agents making a single request per day is by far the highest, and as the number of daily request grows the number of distinct user-agents diminishes. One way to translate that in real-world usages is that a (relatively) small number of bots each do a lot of requests per day, and quite some humans do each a small number of requests per day.

The second chart shows that most user-agents have made requests on a single day of June, less user-agents have made requests on 2 different days etc.

WDQS public cluster distinct user-agents counts per request-count class per day for June 2020

WDQS public cluster distinct days of appearance of user-agents per request-count class for June 2020

Public cluster request-count and max-query-time classes

On the public cluster, most max-query-times taking more than 10s are made by user-agents making a small number of daily queries (1 to 10), while user-agents making big number of daily requests do queries that mostly take less than 1s. No pattern emerges from looking at user-agents max-query-time per day.

WDQS public cluster distinct user-agents counts per request-count class and query-time class for June 2020

WDQS public cluster distinct user-agents count per query-time class per day of June 2020

User-agents with 1 daily request

Having so many user-agents making a single request per day made us wonder if they were more bot-ish or user-ish. To check that we used the fact that the raw user-agent string (the one we used for the analysis above) is parsed into a map of predefined fields using the ua-Parser library.

In the next chart, undefined user-agents are user-agents where no parsed field is set, meaning this user-agent is not a usually parseable one (not user nor regular bot).

WDQS public cluster undefined user-agents request count per day for June 2020

Finally the check of not-undefined user-agents for the day of June 9th (big peak in number of distinct user-agents) shows that a big number of distinct user-agents share the same parsed user-agent value:

OS	Browser	Device	Unique user agents
Windows XP	IE 8	Other	1163
Windows XP	IE 9	Other	1131
Windows XP	Chrome 18	Other	626
Windows NT	IE 9	Other	619
Windows XP	Chrome 31	Other	618
Windows XP	Chrome 21	Other	613

Queries Concurrency

For this section, only the public cluster data has been taken into consideration, and some log-scale. Also, response-codes 500 have been added to queries of this section to acknowledge for queries timing-out.

In-flight queries

The first chart shows that for active hosts (wdqs1*, instead of wdqs2*) the number of in-flight requests peaks a 4 or 5, with a very flat long-tail. The second chart shows that the sum of the query-time of in-flight requests has two modalities: the smallest is between 10ms and 100ms and the biggest between 100s and 1000s. Those two modalities show that there often is at least one or two long-running queries being computing on the active hosts.

WDQS public cluster in-flight sum of query-time per backend host for June 2020

The next two charts show that it seldom happens that a backend host take a long time to process its next query, and that even when processing queries taking a long time, the number of in-flight requests doesn't grow a lot. This means that processing requests taking a long time doesn't block other small requests to be processed, which is good!

WDQS public cluster time-to-next query count for June 2020

WDQS public cluster in-flight sum of query-time-class (floored log10 value) per in-flight request-count for active backend hosts for June 2020

Same repeated query

The query SELECT ?simbad ?item { ?item wdt:P3083 ?simbad VALUES ?simbad {}} has been repeated 150794 times in June 2020, with an average query-time of 76ms. We use this query as a baseline and show that there is no noticeable correlation between queries concurrency and the repeated query processing-time.