Search Platform/Weekly Updates/2024-02-23
Summary
We are in the deployment phase of our multilingual zero-result rate improvements. The new elasticsearch plugins are deployed, the new configuration is ready. Once the configuration is deployed, we will need to reindex all wikis (which takes 2-3 weeks) and analyze the improvements.
We are in active conversation with people from Scholia around the WDQS graph split. The discussions are constructive, but we are identifying major impacts to Scholia queries. Some we might be able to resolve, some we might not. In particular, the authors and scholarly articles are on different graphs, which makes some queries complex and expensive to run. Scholia is not only about scholarly articles, but about other types of papers, which now need to be treated differently. This conversation and investigation needs to continue.
What we've accomplished
Improve multilingual zero-results rate
- The config for the ICU tokenizer + ICU token repair upgrades along with a few new character mappings has been committed. Just need wait for it to be deployed, then we can reindex. - https://phabricator.wikimedia.org/T356643
- Indexing times seem to increase significantly with ICU token repair, but are still within acceptable limites. Some preliminary notes are in our standup notes (https://etherpad.wikimedia.org/p/search-standup) and a full write up will be published soon on https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Analyzer_Harmonization_Notes#Timing_Tests
WDQS graph splitting
- Meeting with Daniel Mietchen and Lane Raspberry
- Agreed to have regular meetings
- Still some questions about why we're doing this and what problem we try to solve
- Question regarding what is wikidata, is it only a wdqs problem or should wikidata stop accepting some data and ask communities to use another hosting solution
more notes: https://etherpad.wikimedia.org/p/wdqs-graph-split-2024-02-15-1
- Started a page with example queries found in the samples and how to rewrite them with federation - https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split/Federated_Queries_Examples
Operations
- backfill w[dc]qs reconciliation dag after a failure of the canary events system, deployed a quick patch to stop creating this dag dynamically and sense both DC partitions via a single sensor, should decouple the use of the wmf_conf.eventgate_datacenters var in airflow that was set to codfw only to workaround the canary event issue (which I reverted batch ["eqiad", "codfw"] to do the backfill).
- We might expect some turbulences during the upcoming dc switch (march 19), DE might still rely on a manual switch of that config var in airflow.