Data Platform/Dashboard tutorial
Please help improve the docs by reviewing and updating this tutorial! See phab:T367577. |
NOTE: Dashiki hasn't been maintained in a while, and therefore only builds with Node version 6.17.1. We are evaluating whether to deprecate or refresh it going forward.
This page describes how to easily build dashboards that visualize statistics on WMF's data sources using Analytics' pipeline and tools.
Overview
Supported data sources
Analytics' dashboarding pipeline supports all data that can be accessed through stat1007.eqiad.wmnet and stat1006.eqiad.wmnet. Examples include:
- MediaWiki database replicas
- EventLogging database replica
- Hive tables on top of pageview, search, edit, and other data
Supported layouts and visualizations
Different dashboard layouts allow you to present different types of data. Available layouts are:
- The tabs layout lets you organize your visualizations in a tab navigation menu. See: User agent breakdowns
- The metrics-by-project layout addresses the problem of showing stats on multiple projects and languages on the same visualization. See: Vital signs
- The compare layout can help you if your metrics are split into 2 realms that are to be compared, for example: Wikitext editor vs. VisualEditor. See: Edit analysis
These layouts can be customized to use the available visualizations:
- Line chart
- Tabular text data
- Sunburst chart
- Stacked bar chart
New layouts and visualizations can be added fairly quickly. The Analytics team will do this for you if necessary.
Necessary knowledge
You don't need to be a software developer to create dashboards using this pipeline. However a basic know-how of the following concepts will be needed:
- SQL to adapt your queries to the expected conventions
- Git and Gerrit for downloading tools and submitting changes for review
Tools used
- Reportupdater It's a python program that periodically executes custom SQL queries against given data sources and progressively builds TSV reports.
- Dashiki It's a serverless dashboarding application written in JavaScript that visualizes TSV reports (among other data sources) and is configured on wiki.
Work time and calendar time expectation
The time it takes to have a dashboard up and running depends on the number of visualizations that you want to include in your dashboard. Also, if this is the first time you use this pipeline, you'll have to spend some time reading this documentation, familiarizing yourself with the tools, creating a repository and deploying a dashboard instance. Finally, the process includes a code review by the Analytics team. As a rough approximation consider: 2-3h getting started + (1-2h work * #visualizations) + 1-2d waiting for Analytics + 1-2h testing and deploying.
Example dashboards
- pingback.wmcloud.org (tracking third-party MW deployments around the world)
- UserAgent and browser reports (browser statistics, new and improved data coming soon, used to be quite popular)
- wmcs-edits.wmcloud.org (edits by tools and bots that use Wikimedia Cloud Services)
Step by step
The following section explains in detail what you need to do to spin up a dashboard using Analytics dashboarding pipeline. It assumes that you already have one or more SQL queries that work on one of the supported data sources. The steps can be divided into 2 blocks: generating reports and generating visualizations. In case you already have TSV reports in analytics.wikimedia.org/datasets/, please skip the Generating reports section and go directly to Generating visualiations.
Generating reports
Adapt your SQL queries to reportupdater's conventions
Reportupdater needs the queries to return a single data point. So modify your queries to look like:
SELECT
DATE('{timestamp_from}') AS date,
COUNT(*) AS metric_1,
SUM(some_value) AS metric_2
FROM some_table
WHERE
timestamp >= '{timestamp_from}' AND
timestamp < '{timestamp_to}'
GROUP BY date;
Note that this query is using 2 reportupdater placeholders: {timestamp_from} and {timestamp_to}. Reportupdater will replace those with the timestamps of a given time interval. The first column the query selects should be named date and should be formatted as YYYY-MM-DD. So the final output of the query, for a given timestamp_from = 20160101000000
and timestamp_to = 20160102000000
will look like this:
date metric_1 metric_2 2016-01-01 34 2532
This is likely to be the only adaptation you need. But you might also want to use reportupdater's explode by wiki feature. With this option enabled, reportupdater will execute your query for each wiki in a list and generate a separate report for each one of them. To enable explode by wiki, put a new placeholder named {wiki_db} in the FROM statement of your query, like this:
FROM {wiki_db}.some_table
If you want to dig deeper into all reportupdater features and options, please take a look at the full reportupdater documentation.
Write the configuration for your reports
Create a folder and put all your queries in it. Also, create a new file in it named config.yaml. The folder structure should look like:
query_folder |__ query_1.sql |__ query_2.sql ... |__ query_N.sql \__ config.yaml
The config.yaml file will contain the pieces of information reportupdater needs to execute the queries and create the reports. It has 3 sections: databases, defaults and reports. This is an example of a config.yaml file:
databases:
el:
host: "analytics-store.eqiad.wmnet"
port: 3306
creds_file: /a/.my.cnf.research
db: log
defaults:
db: el
reports:
query_1:
granularity: days
starts: 2015-04-01
query_2:
granularity: weeks
starts: 2016-02-01
explode_by:
wiki_db: enwiki, dewiki, frwiki
In the databases section, specify the host and port you want to connect to, point to the credentials file and give a default database. In the defaults section, specify that the default db is the one you just configured. And in the reports section, write a config block for each query you want to run. For each report, you have to specify granularity and start date. If you are using the explode by wiki feature, indicate which wiki databases you want reportupdater to execute your queries on. Please, look at the full reportupdater documentation for more detail.
Test locally
A great way to know if your queries and configuration will work in production is testing the generation of your TSV report files on your local machine. To do that, you have to:
- Clone reportupdater repository.
- In a separate terminal, run
ssh -L 3307:analytics-store.eqiad.wmnet:3306 stat1006.eqiad.wmnet
to create an ssh tunnel to the database host. Replace analytics-store.eqiad.wmnet with your host if necessary. - Create a credentials file as explained in the README and put it into
reportupdater/test/locally/
folder. Warning: never push the credentials file to Gerrit! - Modify the database section in your config.yaml file to point to
localhost:3307
, and to point to the credentials file you just created. You can copy the example shown inreportupdater/test/locally/config_example.yaml
. Don't forget to revert that when you're finished testing. - Run reportupdater with
python update_reports.py /path/to/your/query/folder/ /tmp/output -l info
.
You should see reportupdater running and outputing info logs. If everything goes well, the report files will be written to /tmp/output
. Note that this test does not work with queries to the Hive cluster! If this feels too complicated, please let us Analytics know and we'll help you :-).
Deploy reportupdater job
Once reportupdater is able to execute your queries, you can submit a change to gerrit. Add your folder containing the queries and config to the root level of reportupdater-queries repository. If you already have a separate query repository or want to use your own new repository, that's OK. In any case, add someone in Analytics as a reviewer of your patch and ping us. We will review the code and merge it. We'll also create a trigger in puppet to launch reportupdater for you. If everything goes well, after a couple days, you should see your report files made public in analytics.wikimedia.org/datasets/periodic/reports.
Generating visualizations
Write the configuration for your dashboard
Dashiki dashboards are configured on wiki, so you'll need to create a wiki page on meta.wikimedia.org and fill in some lines of JSON configuration. You should name the page Config:Dashiki:NameOfYourDashboard
. Depending on which Dashiki layout you choose, your configuration will be different.
Config for the tabs layout
The most generic Dashiki layout is the tabs layout. With it you can organize your visualizations under a tabs navigation menu. Looks like this: https://analytics.wikimedia.org/dashboards/browsers/#all-sites-by-os
To use it, write a JSON config following this example:
{
"title": "Dashboard Title",
"subtitle": "Dashboard Subtitle",
"desc": "A description of your dashboard",
"tabs": [
{
"title": "Tab 1 Title",
"dataRange": {
"startDate": "2015-06-01" // graphs will show data from this day on
},
"graphs": [
{
"title": "Graph 1 title",
"type": "dygraphs-timeseries", // visualizer name
"path": "path/to/report/file1.tsv", // relative to analytics.wikimedia.org/datasets/periodic
"format": "percent" // number format
}, {
"title": "Graph 2 title",
"type": "hierarchy",
"path": "path/to/report/file2.tsv",
"pivot": {
"dimension": "column to pivot by",
"metric": "column to aggregate (if needed) when pivoting"
}
},
...
]
}, {
"title": "Tab 2 title",
"graphs": [
...
]
},
...
]
}
You can find a full working example in the Sample/tabs dashboard config. Note that the //comments
and ...
in the JSON code above are just explanatory, and they need to be removed.
In the graph-type field you can specify one of: dygraphs-timeseries, table-timeseries or hierarchy. Choose dygraphs-timeseries if you want to graph your data as a line chart. Tip: each column in your report (except the first that holds the date) will become a line in the chart. Choose table-timeseries if you want to graph your data as a table. Or choose hierarchy if you prefer graphing your data as a sunburst hierarchical chart. In the latter, your report file needs to have a cube-like structure with a header like: date dimension1 dimension2 ... dimensionN metric
, where dimension values are categorical strings and metric is a number. In the graph-path field you have to write the path to the report file relative to https://analytics.wikimedia.org/datasets/periodic/reports/metrics/
.
Config for the metrics-by-project layout
If you have a metric that you want to visualize for all wikimedia projects (all 200+) this is the layout you want to use, it looks like this: https://analytics.wikimedia.org/dashboards/vital-signs/#projects=ruwiki,frwiki/metrics=Pageviews
This Dashiki layout is very useful for reports that use the explode by wiki feature (see: Generating reports). It will let the users select the wikis they want to add to the chart in a friendly way. To use it, you have to add a bit of JSON config to an existing page in meta.wikimedia.org first: Dashiki:CategorizedMetrics. Open it for editing and add a config block for your queries:
{
"name": "Category your metrics belong to",
"metrics": [
{
"definition": "https://meta.wikimedia.org/wiki/Some_page_that_explains_your_metric",
"name": "A Readable Metric Name",
"metric": "name_of_your_query_folder",
"submetric": "name_of_your_query",
"api": "datasets"
},
...
]
}
Be careful not to write any JSON incompatible code, otherwise other dashboards may break. You can check if everything is OK by visiting this dashboard. If you see data, everything's fine; otherwise, please revert your change to Dashiki:CategorizedMetrics and try again. After that, you have to write the actual configuration for your dashboard in Config:Dashiki:NameOfYourDashboard
page, but this one is really short:
{
"defaultProjects": [
"enwiki", "dewiki", "frwiki"
],
"defaultMetrics": [
"Metric Name 1"
],
"metrics": [
"Metric Name 1", "Metric Name 2", "Metric Name 3"
]
}
Note that you can only list as much defaultProjects as wiki files exist in your reports folder. And the metric names should be specified in the Dashiki:CategorizedMetrics page.
Config for the compare layout
The compare layout works well when you have two families of reports that share the same metrics, and you want to compare them. An example could be an A/B test, where you want to compare the metrics of both groups A and B. Dashiki expects the report files to be structured like this:
https://analytics.wikimedia.org/datasets/periodic/reports/metrics |__query_name_1 | |__group_name_1 | | |__enwiki.tsv | | |__dewiki.tsv | | \__frwiki.tsv | \__group_name_2 | |__enwiki.tsv | |__dewiki.tsv | \__frwiki.tsv |__query_name_2 | |__group_name_1 | | |__enwiki.tsv ... ... ...
Note that this layout is also designed specifically for the explode by wiki feature! So write a configuration that follows this example:
{
"a": "group_name_1",
"b": "group_name_2",
"startDate": "2015-04-01",
"comparisons": [
{
"title": "Readable Metric Name 1",
"type": "timeseries",
"metric": "query_name_1",
"desc": "Description of the chart 1"
},
{
"title": "Readable Metric Name 2",
"type": "timeseries",
"metric": "query_name_2",
"desc": "Description of the chart 2"
},
...
]
}
Test locally
Once you have the configuration in place on meta.wikimedia.org, you can easily test the dashboard on your local machine (your desktop, no VM needed) following these steps:
- You need npm installed in your machine.
- Clone Dashiki repository and
cd
into it. - Execute
npm install
to setup the dependencies. - Execute
cd semantic && gulp build
. - Build the project with
gulp --layout <LAYOUT> --config <CONFIG>
. Where <LAYOUT> is either tabs, metrics-by-project or compare; and <CONFIG> is the title of your config page in meta.wikimedia.org without the Config: prefix. - Spin up a local file server, for example by executing
python -m SimpleHTTPServer 5000
from thedist
subdirectory that contains the output generated by gulp. - Visit
localhost:5000
on your browser to see your dashboard, hopefully full of colorful charts :-)
Deploy your dashboard
Once you've tested that the dashboard works in your local machine, you can deploy it for other people to see. Add a block like this one to the end of the config.yaml file living in dashiki's root directory:
name-of-your-dashboard:
layout: tabs
config: MyDashboardConfig
hostname: name-of-your-dashboard-test.wmflabs.org
And submit a patch to Dashiki's repository in Gerrit. We Analytics will review, merge and deploy it for you!
Support
Even when this documentation is not amazing by far, we Analytics will love to help you in configuring your dashboards with this pipeline. So, for any question, feedback or comment on this process, please send an email to the Analytics mailing list or open a ticket on Phabricator tagging it with the Analytics project. Thanks!