Search/CirrusSearch
CirrusSearch
CirrusSearch is a MediaWiki extension that provides search support backed by Elasticsearch. If you want to extend the data that is available to CirrusSearch, have a look at Search/TechnicalInteractionsWithSearch.
Configuration
The canonical location of the configuration documentation is in the docs/settings.txt
file in the extension source. It also contains the defaults, but the source of truth for defaults is extension.json
. A pool counter configuration example lives in the README in the extension source.
WMF configuration overrides live in the ext-CirrusSearch.php
and CirrusSearch-{common|production|labs}.php
files in the mediawiki-config git repo.
Local Build
A dockerized Cirrus environment with cirrus, elasticsearch, and related bits can be sourced from our integration test runner.
Logging
Via Logstash
Logs from CirrusSearch can be found from the general Mediawiki logstash dashboard (requires NDA-level access). You can filter with channel:CirrusSearch AND "backend error"
. This isn't as specific as we'd like it to be, but it should be enough to help get you started.
Via Logging Hosts
The logs generated by cirrus are located on mwlog1001.eqiad.wmnet:/a/mw-log/
:
CirrusSearch.log
: the main log. Around 300-500 lines generated per second.CirrusSearchRequests.log
: contains all requests (queries and updates) sent by cirrus to elastic.Generates between 1500 and 3000+ lines per second.CirrusSearchSlowRequests.log
: contains all slow requests (the threshold is currently set to 10s but can be changed with $wgCirrusSearchSlowSearch). Few lines per day.CirrusSearchChangeFailed.log
: contains all failed updates. Few lines per day except in case of cluster outage.
Useful commands :
See all errors in realtime (useful when doing maintenance on the elastic cluster)
tail -f /a/mw-log/CirrusSearch.log | grep -v DEBUG
WARNING: you can rapidly get flooded if the pool counter is full.
Measure the throughput between cirrus and elastic (requests/sec) in realtime
tail -f /a/mw-log/CirrusSearchRequests.log | pv -l -i 5 > /dev/null
NOTE: this is an estimation because I'm not sure that all requests are logged here. For instance: I think that the requests sent to the frozen_index are not logged here. You can add something like 150 or 300 qps (guessed by counting the number of "Allowed write" in CirrusSearch.log)
Measure the number of prefix queries per second for enwiki in realtime
tail -f /a/mw-log/CirrusSearchRequests.log | grep enwiki_content | grep " prefix search " | pv -l -i 5 > /dev/null
CirrusSearch Indexing
Diagram?
CirrusSearch updates the elasticsearch index by building and upserting almost the entire document on every edit. The revision id of the edit is used as the elasticsearch version number to ensure out-of-order writes by the job queue have no effect on the index correctness. There are a few sources of writes to the production search clusters, although CirrusSearch is the majority of writes. Writes also come from:
- Cirrus Streaming Updater, a flink application that is due to replace the writes performed by the job queue
- mjolnir-bulk-daemon, run on search-loader instances, pushes updates generated by teams airflow instance into the search clusters. This is primarily the weighted_tags field.
- logstash, run on apifeatureusage instances, writes to its own indices in the search clusters
You can run some scripts from mwmaint1002.eqiad.wmnet, but you need to use a deployment server for backfills.
Adding new wikis
All wikis have Cirrus enabled as the search engine. To add a new Cirrus wiki:
- Estimate the number of shards required (one, the default, is fine for new wikis).
- Create the search index
- Populate the search index
Create the index
mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki $wiki --cluster=all
That'll create the search index on all necessary clusters with all the proper configuration.
Populate the search index
mkdir -p ~/log clusters='eqiad codfw cloudelastic' for cluster in eqiad codfw cloudelastic; do mwscript extensions/CirrusSearch/maintenance/ForceSearchIndex.php --wiki $wiki --cluster $cluster --skipLinks --indexOnSkip --queue | tee ~/log/$wiki.$cluster.parse.log mwscript extensions/CirrusSearch/maintenance/ForceSearchIndex.php --wiki $wiki --cluster $cluster --skipParse --queue | tee ~/log/$wiki.$cluster.links.log done
If the last line of the output of the --skipLinks
line doesn't end with "jobs left on the queue" wait a few minutes before launching the second job. The job queue doesn't update its counts quickly and the job might queue everything before the counts catch up and still see the queue as empty. If this wiki is a private wiki then cloudelastic should be removed from the set of clusters. No harm if it's included, but it will (should?) throw exceptions and complain.