Jump to content

Data Platform/Data Lake/Traffic/Pageviews/Bots

From Wikitech

TL;DR

The bot detection has now been implemented. See more details at Analytics/Data Lake/Traffic/BotDetection.

While we see several projects that will benefit from a more precise bot identification we think that at this time there are workarounds that we can do to filter bot traffic in most areas and that we should not expend the resources and computation effort that a through bot detection system will require.

We think is worth spending time in quantifying our TRUE bot traffic so when management has a question like "How much of our traffic is crawling?" we can give an estimate by, say, researching bot traffic monthly, weekly and daily in one given month.

At this time 15% of our pageview traffic (not requests) are detected bots, we estimate that the real bot traffic might be quite a big higher.


Action Items

  • [Erik Z., Joseph] Quantify % of bots detected using regex. DONE. It is about 15% of our pageviews (may have been more before switch to https, we're checking).
  • [Nuria] Investigate whether is possible to set a flag on x-Analytics when a request has no cookies whatsoever, might be a cheap way to probabilistically identify bots. https://phabricator.wikimedia.org/T114370
  • [Analytics Engineering or Research] Studying one month of data give an estimate of bot traffic monthly, weekly and daily so we have a number to work with and communicate to C-level.

Meeting Notes 2015-09-30

Attending: Ellery, Joseph, Leila, Madhu, ErikZ, Nuria

The main driving factor on setting up this meeting is to decide whether is worth committing the effort to identify bots. Tagging Bot traffic will helpful to, for example, be able to use the Last-access method to count uniques: Analytics/Unique_clients/Last_access_solution we went around to give a go of other projects that will benefit from proper bot identification.

Ellery: Research projects that look at (human) browsing patterns, we need to first filter bots to make sense of data. While tedious this is doable by, for example filtering any minute interval traffic coming from one source at too high of a frequency.

Leila: Having properly tagged bots will help teams like Readership to truly know the amount of human pageviews we have. c-levels are not in general aware of true bot traffic

Joseph: agreeing it will be useful but not sure if worth computational cost, if we set cookies and trace requests with and without cookies we are half the way there. Madhu: let's investigate whether we can inspect cookies on varnish and tag requests that do not have any (see action items)

Our new code identifies 15% of pageviews as bots. Erik (and QChris) are pretty sure that regex catches up to 95% of non-malicious bots. To be done: check that the percentage bots in all pageviews was higher before mid 2015 (not that much, it seems, Erik will check further).

What about uniques? Don't they suffer from this bot problem when we are to implement them?

Ellery & Leila: No, cause we can use the "2+ visit" data we will get easily with uniques to estimate out "1 visit". Increasing the 2+ KPI will work as a proxy to increase the 1+ KPI.