User:Joal/WDQS Traffic Analysis
Analysis of WDQS traffic on public and internal clusters. The charts and data have been computed on Jupyter notebooks running Spark on the Analytics hadoop cluster. Processed data is events sourced through the Modern Event Platform. Originally written on March 2020, reran with June 2020 for the current version of the charts.
Global traffic information
HTTP response codes
Most requests on internal and external traffic generate HTTP result-code 200
(success). The rest of this analysis consider only requests having ended in a 200
response code, except explicitly stated otherwise.
Note: The scale of the two charts are different - See the next section for a comparison of number of requests among clusters.
Public vs Internal
The number of 200
requests to the public cluster is about half of the number of requests to the internal one.
Distinct queries
The number of daily distinct queries for the public cluster is about 61% of the total number of queries for June 2020, and 27% for the internal cluster. This means that each request is repeated on average 0.65 times per day for the public cluster, and 2.75 times per day for the internal one.
Query-time
One of the reasons for which analyzing query-time is interesting is as a proxy for resource-usage in the backend system: a long query is supposedly using more computation resource than a fast one.
Public vs Internal
Despite serving about 2 times more requests than the public cluster, the internal cluster has a daily sum of query-time about 10 times smaller than the public cluster.
Query-time classes
It is interesting to note that for the public cluster, the requests taking more than 10s represent a very small number of requests and take most of the processing time.
Note: These charts are generated using Google Docs as the chart-system used in notebooks doesn't feature dual-axis charts.
Correlations (or not)
For the internal cluster, the sum of query-time is visually strongly correlated to the number of requests done (known query-class performing at regular speed). For the public cluster, there is no such correlation, due to the variety of query-classes (and implementations). Similarly, there is no visually-noticeable correlation between query-time and request-length (number of characters in the request), meaning that the query-length is not a good enough predictor of query complexity.
User agents
In this section some log-scale have been applied to facilitate readability of small value. Look at the scales!
Public vs Internal
For the public cluster, the number of daily distinct user-agents is quite variable, with most values between 20,000 and 30,000. User-agents querying the internal cluster are the expected ones, namely the WikibaseQualityConstraints tools for Wikidata and commonswiki and two tools checking that the service is up.
Public cluster requests-count classes
On the public cluster, the number of user-agents making a single request per day is by far the highest, and as the number of daily request grows the number of distinct user-agents diminishes. One way to translate that in real-world usages is that a (relatively) small number of bots each do a lot of requests per day, and quite some humans do each a small number of requests per day.
The second chart shows that most user-agents have made requests on a single day of June, less user-agents have made requests on 2 different days etc.
Public cluster request-count and max-query-time classes
On the public cluster, most max-query-times taking more than 10s are made by user-agents making a small number of daily queries (1 to 10), while user-agents making big number of daily requests do queries that mostly take less than 1s. No pattern emerges from looking at user-agents max-query-time per day.
User-agents with 1 daily request
Having so many user-agents making a single request per day made us wonder if they were more bot-ish or user-ish. To check that we used the fact that the raw user-agent string (the one we used for the analysis above) is parsed into a map of predefined fields using the ua-Parser library.
In the next chart, undefined user-agents are user-agents where no parsed field is set, meaning this user-agent is not a usually parseable one (not user nor regular bot).
Finally the check of not-undefined user-agents for the day of June 9th (big peak in number of distinct user-agents) shows that a big number of distinct user-agents share the same parsed user-agent value:
OS | Browser | Device | Unique user agents |
---|---|---|---|
Windows XP | IE 8 | Other | 1163 |
Windows XP | IE 9 | Other | 1131 |
Windows XP | Chrome 18 | Other | 626 |
Windows NT | IE 9 | Other | 619 |
Windows XP | Chrome 31 | Other | 618 |
Windows XP | Chrome 21 | Other | 613 |
Queries Concurrency
For this section, only the public cluster data has been taken into consideration, and some log-scale. Also, response-codes 500
have been added to queries of this section to acknowledge for queries timing-out.
In-flight queries
The first chart shows that for active hosts (wdqs1*, instead of wdqs2*) the number of in-flight requests peaks a 4 or 5, with a very flat long-tail. The second chart shows that the sum of the query-time of in-flight requests has two modalities: the smallest is between 10ms and 100ms and the biggest between 100s and 1000s. Those two modalities show that there often is at least one or two long-running queries being computing on the active hosts.
The next two charts show that it seldom happens that a backend host take a long time to process its next query, and that even when processing queries taking a long time, the number of in-flight requests doesn't grow a lot. This means that processing requests taking a long time doesn't block other small requests to be processed, which is good!
Same repeated query
The query SELECT ?simbad ?item { ?item wdt:P3083 ?simbad VALUES ?simbad {}}
has been repeated 150794 times in June 2020, with an average query-time of 76ms. We use this query as a baseline and show that there is no noticeable correlation between queries concurrency and the repeated query processing-time.