Data Platform/Data Lake/Traffic/referrer daily/Dashboard
The referrer_daily
dataset has many facets and ways in which someone might want to aggregate or split the data. Turnilo is currently the best solution at Wikimedia for visualizing data with these properties, but our Turnilo instance is designed for private datasets so a public instance needed to be created in order to share this dataset more broadly. Some technical details are given below and the dashboard can be found at: https://wiki-search-referrals.wmcloud.org
Turnilo instance
The turnilo dashboard is hosted on a Cloud VPS instance by the Wikimedia Research team. The code for setting up the instance can be found here: https://github.com/wikimedia/research-api-endpoint-template/tree/turnilo-druid
Data backend
Turnilo depends on a Druid database backend to scale up effectively. Initially as TSV-backend was used but this quickly became too slow as the size of the dataset grew.
Updating
Until a more streamlined workflow is developed, updates are handled via a string of daily scripts run via crontab:
- Export new TSV with yesterday's data from HDFS
- Reformat data to match Turnilo's expected format and append to single flat-file (see below)
- Update flat-file on Analytics server
- Download new file on Turnilo instance and restart dashboard with updated data backend
Reformat data for Turnilo (click button on right to expand) |
---|
import argparse
import csv
from datetime import datetime
import json
import os
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--input_tsv_dir", help="TSV with referral data")
parser.add_argument("--output_tsv", help="TSV for Turnilo")
parser.add_argument("--append", action="store_true", help="Append to end of JSON")
parser.add_argument("--header", action="store_true", help="TSV file has a header with fieldnames.")
args = parser.parse_args()
tsvs = sorted([fn for fn in os.listdir(args.input_tsv_dir) if fn.endswith('.tsv')])
input_header = ['country', 'lang', 'browser_family', 'os_family', 'search_engine', 'num_referrals', 'day']
output_header = ['time', 'country', 'lang', 'browser_family', 'os_family', 'search_engine', 'num_referrals']
mode = 'w'
if os.path.exists(args.output_tsv) and args.append:
mode = 'a'
with open(args.output_tsv, mode) as fout:
tsvwriter = csv.writer(fout, delimiter='\t')
if mode == 'w':
tsvwriter.writerow(output_header)
for tsv in tsvs:
print("Processing:", tsv)
with open(os.path.join(args.input_tsv_dir, tsv), 'r') as fin:
tsvreader = csv.reader(fin, delimiter='\t')
if args.header:
input_header = [c.strip() for c in next(tsvreader)]
for line in tsvreader:
line_json = {c:v for c,v in zip(input_header, line)}
line_json['num_referrals'] = int(line_json['num_referrals'])
line_json['time'] = datetime.strptime(line_json['day'], '%Y%m%d').strftime('%Y-%m-%d')
tsvwriter.writerow([line_json[c] for c in output_header])
if __name__ == "__main__":
main()
|