Search Platform/Weekly Updates/2023-03-31
Appearance
Summary
We are wrapping up the quarter, doing some clean up and getting ready for next quarter. Overall, I'm very happy with what we got done this quarter. We were overly ambitious with our Search Update Pipeline goal, which was identified early in the quarter.
Some highlights:
- We are well over our SLO for WDQS uptime over the last 90 days (https://grafana.wikimedia.org/d/l-3CMlN4z/wdqs-uptime-slo). We still have a few struggles with the official SLO dashboard and its performance (https://grafana.wikimedia.org/d/slo-WDQS/wdqs-slo-s).
- The plan around scaling WDQS has been communicated (https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update/March_2023_scaling_update), with 2 office hours meetings with our communities. This raised very important questions around the possibility to optimize how we use item descriptions, which could help to significantly reduce the amount of data stored in WDQS.
- More analyzers have been unpacked, including Japanese, Armenian, Latvian, Hungarian, Bulgarian, Lithuanian, Persian, Romanian and Sorani. Turkish and Brazilian are the only 2 analyzers left to unpack. The main goal of this work was to have a coherent way of configuring analysis chains across languages, which will allow us to apply future improvements across all the languages we support. A side benefit of this work is that when we unpack an analyzer, we already introduce some standard behaviour which improve the language support. You can review the full unpacking notes for more details about individual language improvements: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Unpacking_Notes
- The migration to Spark 3 is completed (on the last day of the quarter - we've never been that precise on any estimation). This was a good collaboration exercise with the Data Engineering team. We fumbled a little bit with coordination at first, but the results are there. We are now running all of our DAGs on Spark 3 and on a brand new Airflow 2 instance. This should help Data Engineering to keep their platform up to date, and help us reduce our operating burden as this new Airflow instance is aligned with what Data Engineering is providing and the operation is handed over to the Data Engineering team.
What we've accomplished
Search - Analysis
- Romanian and Sorani wikis have been reindexed to enable the unpacked and upgraded analyzers. In particular, Romanian sees a big increase in the number of results returned as we are now merging ş/ș and ţ/ț. See the full write up on https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Unpacking_Notes#Romanian/Sorani_Reindexing_Impacts - https://phabricator.wikimedia.org/T330783
Spark 3 Upgrade
- Migration of all DAGs is complete! See https://phabricator.wikimedia.org/T322905 and all subtasks for detail.
Search Update Pipeline
- Start work on a ci pipeline for Flink applications -https://phabricator.wikimedia.org/T326318
- More progress on deploying Flink with k8s operators, but still not there.
- CirrusSearch Update Lag is being collected, but there are some discrepancies in the data that will need further investigation - https://phabricator.wikimedia.org/T320408
Operations / SRE
- Fix permissions in hdfs://analytics-hadoop/wmf/data/discovery - https://phabricator.wikimedia.org/T331580
Misc
- OKRs for Q4 are almost ready, pending a final review by the team.
- WDQS Stabilization plan has been communicated, with 2 follow up office hours to answer questions. One of those was in an Australia compatible time, which was very well received by a few Australian community members (it is too rare that we schedule office hours at a compatible time for them).
- WDQS Office hours raised an interesting data duplication issue that might be reasonable to address. A number of item descriptions could be automatically inferred instead of being entered into Wikidata. See https://phabricator.wikimedia.org/T303677