Data Platform/Systems/Reportupdater
reportupdater is a Python program that periodically executes given SQL queries (or scripts with SQL-like output) and appends their results into timeline TSV report files. If you have crons on any of the analytics data machines harvesting data you can likely benefit from using reportupdater. Think of it as a more "robust" way to execute cron jobs.
Some datasets generated by reportupdater: https://analytics.wikimedia.org/datasets/periodic/reports/metrics/
Some of its features are:
- Detects which data points are missing in the reports and executes the corresponding queries to update those missing values.
- Supports different report granularities: daily, weekly and monthly.
- Allows parametrizing the query (or script) to generate several reports, one for each value of the parameter.
- Generates consistent and comprehensive logs with different levels.
- It is robust against interrupted or parallel execution, so report files don't get corrupted.
- You can collect data from the MediaWiki databases, the EventLogging database and from the Hive cluster.
Download
Get code from gerrit. The code also mirrored to Wikimedia's GitHub.
Dependencies
You need python 3 to run reportupdater.
To install dependencies, cd
into reportupdater's root directory and execute pip install -r requirements.txt
.
Queries and scripts
You should write a set of either SQL queries or executable scripts (or both combined) to be run by reportupdater. Regardless of which you choose, you must write code that returns a single data point per metric, per day. For example, if splitting by wiki then the metric path should include {wiki}
and the output would include exactly one row for each value of the wiki
column. Additional dimensions can be included as columns in the output, and should be included in the metric path, for example Foo.byEditCount.{edit_count_bucket}.byWiki.{wiki}
.
Reportupdater will execute the query for each specified date interval to build the whole timeline report.
There are a couple conventions that your code should follow:
SQL Query conventions
- Use the placeholders
{from_timestamp}
and{to_timestamp}
to timebound the results, for example:WHERE timestamp >= '{from_timestamp}' AND timestamp < '{to_timestamp}'
(note that from_timestamp should be inclusive, while to_timestamp should be exclusive). Their format is YYYYMMDD000000. - The first column of the results must be
DATE('{from_timestamp}') AS date
. This is an unnecessary limitation and might be removed in the future, but for now it's like that:/
. - There is no restriction on the number, format or contents of the subsequent columns.
Script conventions
- The script file MUST be marked as EXECUTABLE, otherwise it will give a cryptic permission denied error
- The first 2 parameters passed to the script are
start_date
andend_date
, their format is YYYY-MM-DD. Use them to timebound the results (note that start_date should be inclusive, while end_date should be exclusive). - The output of the report must be in TSV format, and must contain a first row with a header.
- The first column must be equal to start_date parameter (consider naming it date). This is an unnecessary limitation and might be removed in the future, but for now it's like that
:/
. - There is no restriction on the number, format or contents of the subsequent columns.
Where to put these files?
You should put all queries and scripts inside the same dedicated directory. When executing reportupdater you'll pass that directory as an argument and reportupdater will know where to find your code. If you're wanting to host your datasets at https://analytics.wikimedia.org/datasets/periodic/reports/metrics/, put your queries and scripts in a repo and then ask the Analytics team to clone them to wherever they clone them to.
Sample queries executed by other teams
Against eventlogging database:
https://github.com/wikimedia/analytics-discovery-stats
Config file
You also need to write a YAML config file that will tell reportupdater how to run the reports. There are four main sections in the config file:
The databases section
If you're not using SQL queries, you can skip this section. The databases section specifies which databases to connect. Usually, it should include the host, the port, the credentials file and the database name, like this:
databases: any_db_id: host: analytics-slave.eqiad.wmnet port: 3306 creds_file: /etc/mysql/conf.d/research-client.cnf db: log
But if your queries are hitting the analytics replicas (former dbstore1002 or analytics-store) the configuration should be different. The analytics replicas have been restructured for performance and maintainability, and they are now divided in shards. RU provides a specific configuration option, so you don't need to deal with shards. You should set `auto_find_db_shard` to true, and drop `host` and `port`, like this:
databases: analytics_replica: auto_find_db_shard: true creds_file: /some/path/to/.my.cnf db: enwiki
Gotcha: if the table that you're querying belongs to an extension (i.e. echo tables), it will be in a separate shard called x1. So you have to indicate it by setting `use_x1` to true:
databases: analytics_replica: auto_find_db_shard: true use_x1: true creds_file: /some/path/to/.my.cnf db: enwiki
Now, if you are using the `explode_by` feature, i.e. with `explode_by: wiki`, then a single query is connecting to several databases. The way to specify that would be to set `wiki_db_placeholder` to the name of the placeholder you're exploding the database names (in this case `wiki`). Also, you should remove the `db` parameter:
databases: analytics_replica: auto_find_db_shard: true wiki_db_placeholder: wiki creds_file: /some/path/to/.my.cnf
The graphite section
It specifies which graphite server to send stats to. It should include the host and the port. It can optionally include a dictionary of lookups, in which the key will be an identifier for a value in a metric execution. For example, if your metric explodes by wiki, you can provide a lookup dictionary for the identifier "wiki". This could be used to translate the wiki database name, like enwiki, to the wiki hostname, like en.wikipedia. The value in the lookups dictionary is a filename to load as that lookup. Or, if you'd like to translate values from a column in your report, you can provide a dictionary for "column1" for example.
graphite: host: some.graphite.host port: 1234 lookups: wiki: /some/path/to/wiki-lookup.yaml column1: /some/path/to/column1-lookup.yaml
If you're not sending stats to graphite, you can skip this section.
The reports section
It describes the reports to be executed by reportupdater. For each report, there are three mandatory fields you must provide: the report id, the granularity and the start date; and there are some optional fields you can also specify, like this:
reports: anyReportId: granularity: days starts: 2016-01-01 anyOtherReportId: granularity: months starts: 2014-06-15 type: script lag: 10800 yetAnotherReportId: granularity: weeks starts: 2000-01-01 max_data_points: 10 explode_by: dimension1: valueA, valueB, valueC dimension2: valueK, valueL graphiteReportId: granularity: days explode_by: wiki: enwiki, dewiki, eswiki graphite: path: "{_metric}.{column0}.{wiki}" metrics: "graphite.metric.name.prefix1": column1ToSendAsValue "graphite.metric.name.prefix2": column2ToSendAsValue
- The report id
- It must match the name of the custom executable file. For example, if you're using a SQL query named
random_report.sql
, then the report id must berandom_report
(without .sql). If you're using a script, then name it the same as the report id. The report id also dictates the name of the generated report: if you use the idrandom_report
, the resulting report will be namedrandom_report.tsv
.
- granularity
days
,weeks
ormonths
. Depending on this parameter, reportupdater will execute your queries/scripts every day, week or month; passing the corresponding date range to them. Mandatory.
- starts
- Indicates the first date to be computed (inclusive). Its format should be YYYY-MM-DD. Mandatory.
- type
sql
orscript
. If sql, reportupdater will interpret your custom executable as SQL code. If script, it will execute your custom code via the command line. Optional. Default:sql
.
- lag
- Positive integer. The time to wait (in seconds) to start the execution of the query/script after the desired time interval has finished. It can be useful to wait until the source data is loaded into the database. Optional. Default: 0.
- db
- String. Should be one of the database identifiers specified in the databases section. Indicates which database should reportupdater connect to in order to execute the query. It only works with sql reports, if your report is of type script, this option is ignored. Optional. Default: see defaults section.
- max_data_points
- Positive integer. The maximum of data points the report will contain. Note it depends on the granularity: If you have a weekly granularity and set max_data_points to 10, your report will hold the data for the last 10 weeks. Optional. If not set, the report will hold all data points since the specified starts parameter.
- execute
- String. You may want to have two or more reports using the same query or script. You can do so by assigning a query or script name to the execute field of a report. Optional. If it is not set, reportupdater will execute the query/script that matches the report identifier.
- explode_by
- Dictionary<string: list> (in YAML format). If you want to have N identical reports, one for each section of your product, you can configure it by using explode_by. For example:
wiki: enwiki, dewiki, jawiki, frwiki
would generate 4 reports, one for each wiki. Note that you would have to include the{wiki}
placeholder in your SQL query to slice its results. If you're using a script, all explode_by values will be passed as arguments to the script (right after the date arguments) in alphabetical order. If you want, you can specify a file path instead of the value list; if you do so, reportupdater will read that file to get the value list (separate values using/n
).
- graphite
- Dictionary. If you want to send your report results to graphite, you must configure the metric names and values that will be sent. In this section, you define two things. First is a path template string. This is a python string template that will be filled in with values from a single report result row. This string will be formatted with values from the row itself (keyed by column name), values from the explode by section (keyed by the explode by key), and the special {_metric} described next. Second is a dictionary of metrics to send. The key in this dictionary will be substituted in the path string where you use {_metric}. And the value is the column to use for the actual value to send to graphite. For a complete example using graphite, see https://gerrit.wikimedia.org/r/#/c/322007/6/reports/config.yaml.
The defaults section
Any value that you assign here will apply to all reports. If a report also has a value for that same key, it'll override the default value.
The examples below show a config with defaults compared to an equivalent config without defaults:
- With defaults
defaults: type: script starts: 2018-01-01 explode_by: wiki: wikis.txt daily_report: granularity: days weekly_report: granularity: weeks monthly_report: granularity: months
- Without defaults
daily_report: type: script starts: 2018-01-01 granularity: days explode_by: wiki: wikis.txt weekly_report: type: script starts: 2018-01-01 granularity: weeks explode_by: wiki: wikis.txt monthly_report: type: script starts: 2018-01-01 granularity: months explode_by: wiki: wikis.txt
How to execute?
Use the update_reports.py executable in reportupdater's root directory.
See source.
usage: update_reports.py [-h] [--config-path CONFIG_PATH] [--wikis_path WIKIS_PATH] [-l LOG_LEVEL] query_folder output_folder Periodically execute SQL queries or scripts and write/update the results into TSV files. positional arguments: query_folder Folder with *.sql files and scripts. output_folder Folder to write the TSV files to. optional arguments: -h, --help show this help message and exit --config-path CONFIG_PATH Yaml configuration file. Default: <query_folder>/config.yaml. -l LOG_LEVEL, --log-level LOG_LEVEL (debug|info|warning|error|critical)
You'll find it useful as well to create a cron job that executes reportupdater periodically, for example every hour.
Re-runs
Sometimes, at the time of the queries, the data sources are incomplete or corrupt. In those cases, reportupdater might generate wrong values for a given date. If this happens, you can use the re-run functionality of reportupdater: rerun_reports.py executable in the root directory.
usage: rerun_reports.py [-h] [--config-path CONFIG_PATH] [-r REPORT] query_folder start_date end_date Mark reports to be re-run for a given date range. positional arguments: query_folder Folder with *.sql files and scripts. start_date Start of the date range to be rerun (YYYY-MM-DD, inclusive). end_date End of the date range to be rerun (YYYY-MM-DD, exclusive). optional arguments: -h, --help show this help message and exit --config-path CONFIG_PATH Yaml configuration file. Default: <query_folder>/config.yaml. -r REPORT, --report REPORT Report to be re-run. Several reports can be specified like this. If none is specified, all reports listed in the config file are marked for re-run.
Note that the execution of rerun_reports.py does not re-run the reports per se. It only marks given reports and dates so that update_reports.py can re-run them in its next execution. Please, do not alter report files manually! Doing so can result in permission problems.
Example: rerunning all history for the active-boards report from the limn-flow-data repository, in the flow folder:
python /srv/reportupdater/reportupdater/rerun_reports.py -r active-boards /srv/reportupdater/jobs/limn-flow-data/flow 2009-11-08 2017-12-11
WMF-specific
How to test?
To test your queries against the real database, just run the queries there before submiting a code patch.
To test on any analytics client machine:
- you need both repositories cloned: reportupdater-queries and reportupdater
- get the reportupdater-queries changeset you are interested in testing
- if you're testing Hive queries, make sure you've authenticated through kinit
- if you're testing queries to MariaDB databases, make sure the credentials file is pointed correctly
/usr/bin/python3 /path/to/reportupdater/update_reports.py -l info /path/to/reportupdater-queries/<your directory>/ ~/reportupdater-output
- It might be necessary to execute using the analytics user, if you do so, give permits to the reportupdater-queries repo, so that the pid file can be created.
How to productionize?
- Add your queries/scripts and config file to a new folder within this repo, and ask someone in the Analytics team to review.
- Ask the Analytics team to puppetize the execution of your reportupdater instance, or create a puppet patch yourself adding your reportupdater job in [1], and ask someone in the Analytics team to review. There are two puppet modules that run jobs, one for mysql and one for hive.
Troubleshooting
Where are logs ?
For reports running on cron, logs are written to /srv/reportupdater/log. For reports running on systemd timers, see how logs are accessed at Analytics/Systems/Managing_systemd_timers.
Where is puppet code?
- https://github.com/wikimedia/puppet/tree/production/modules/reportupdater
- https://github.com/wikimedia/puppet/blob/production/modules/role/manifests/statistics/cruncher.pp
Support
reportupdater is maintained by WMF's Analytics team.
Next features
- Today the queries and scripts have to output a first column with the date. However this is not necessary and can be removed.
- Make all report config fields defaultable in the defaults section, like db.