Search Platform/Weekly Updates/2023-03-10
Appearance
Summary
Spark 3 upgrade has been unblock, thanks to work by the Data Engineering team on providing us with a working Airflow 2 instance. We are on track to deliver this work, with maybe some minimal leftover work to deploy all DAGs.
The Search update pipeline is going well with our revised expectations. We should be able to deliver update lag metrics, and validate the deployment on k8s with Flink operators. The planned functional work is delayed until next quarter.
What we've accomplished
Search Analysis
- Sorani and Romanian analyzers are unpacked, with the usual notes on impact: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Unpacking_Notes#Romanian_and_Sorani_Notes_(T325091) - https://phabricator.wikimedia.org/T325091
- Improvements to the Romanian stop work list have been sent upstream to Lucene - https://github.com/apache/lucene/pull/12172
Search Update Pipeline
- Removed the old presto client for Swift in the Flink image that we use, enabling the upgrade to a more recent version of Flink - https://phabricator.wikimedia.org/T304914
Spark 3 Upgrade
- Airflow 3 instance available to start deploying the DAGs that have been migrated to Spark 3 - https://phabricator.wikimedia.org/T327970
- Common supporting code for most DAGs migrated to Spark 3
- Mjolnir (our most complex Spark pipeline) has been merged and tested on Airflow 3. Complete deployment still pending - https://phabricator.wikimedia.org/T329239
- Multiple simpler Spark DAGs migrated to Spark 3 - https://phabricator.wikimedia.org/T329870
Operations / SRE
- WDQS data reload completed on all WDQS servers - https://phabricator.wikimedia.org/T323096. This solves a few known data inconsistency issues, in particular: https://phabricator.wikimedia.org/T322869
- k8s upgrade, which affected our WDQS update pipeline. The upgrade itself went well, but raised problems with a known issue in how we report update lag to bots, leading to bots self throttling for no good reasons. https://phabricator.wikimedia.org/T331405
- Re-racking elasticsearch servers in eqiad to allow enabling 10G networking, leading to faster recovery in case of maintenance operations, or unplanned issues - https://phabricator.wikimedia.org/T317816
- WDQS is not paging on individual servers failing. This is in line with the WDQS SLO and reduces the pressure on our SRE teams - https://phabricator.wikimedia.org/T325324
- Commons Wiki reindexed to ensure all multi lingual analysis chains changes are applied - https://phabricator.wikimedia.org/T327895
- JVM upgrade for all elasticsearch and W[CD]QS servers - https://phabricator.wikimedia.org/T329957
- Note that we also took care of apifeatureusage servers, even if we think the ownership of those servers should move out of the Search Platform team.
- WDQS SLO dashboard is now fully functional: https://grafana.wikimedia.org/d/slo-WDQS/wdqs-slo-s?orgId=1 - https://phabricator.wikimedia.org/T323064
- New WDQS servers put in service, old ones decommissioned - https://phabricator.wikimedia.org/T301167
- Investigation into removing logstash from the Elasticsearch servers and relying directly on log4j sending messages in the appropriate format to syslog. This is proving more complex than expected, due to the Java security manager. The cost of keeping logstash as a log forwarder is low, so we'll keep it - https://phabricator.wikimedia.org/T324335
Misc
- Ongoing work to define a set of SLO and KPI around Search - https://docs.google.com/document/d/1gYROXo8Fl7JSxReHAVI22EhcPvG-INVkq79a1C3tfK0/edit
- Fixed a bug with Search indices not being properly updated when pages move between categories - https://phabricator.wikimedia.org/T331127
- Work on Search dashboards is restarting, but the highest priority for our data analyst is supporting SDAW. We will see how much time is left to do some improvements on the Search dashboards.
- While we don't yet have full clarity on what our priorities will be for next fiscal year, we have some inputs:
- The highest priority should be on scaling WDQS, by investigating splitting the Wikidata graph.
- Search work should focus on supporting editors more than supporting casual readers, which is a change of direction compared to what we have been doing in the past.