Search Platform/Weekly Updates/2024-03-08
Summary
ICU Token Repair work is deployed, reindexing is in progress and will take a few weeks. Analysis of the impact will be done once reindexing is completed.
Search Update Pipeline: operational procedures are in place, with some significant changes to how we do reindex. We'll wait a week or two before migrating production wikis to the new SUP. It is still likely that the full migration will spill over to next quarter.
WDQS graph splitting: we're working mostly on impact analysis, user documentation and support for Scholia in their migration. There are open questions from Scholia to see if they will migrate to our split graph or pursue other alternatives (hosting their own Qlever instance of the full graph).
The Search Platform team will be at an in person offsite next week, there will be no status update next week.
What we've accomplished
Improve multilingual zero-results rate
- Repair multi-script tokens split by the ICU tokenizer, development, deployment and configuration are complete, reindex is tracked separately - https://phabricator.wikimedia.org/T332337
- ICU Tokenizer is enabled on (almost) all wikis -https://phabricator.wikimedia.org/T356643
- New textify plugin is deploy, including the work on ICU token repair - https://phabricator.wikimedia.org/T356651
- Reindexing continues! codfw reindexed through d (skipping Commons) without incident, so I've queued up everything else that left and let it rip. I put Commons and Wikidata first just to get them out of the way. Everything else will be alphabeticalish. eqiad has about an hour left to go on the d's, then it will follow with Commons, Wikidata, and e-z. - https://phabricator.wikimedia.org/T342444
- Work on "Dotted I fixing" is starting - https://phabricator.wikimedia.org/T358495
Search Update Pipeline
- Implemented custom flink sink around elasticsearch's bulk client, mainly to levarage retries, and in general, to get more control over the client configuration, see - https://phabricator.wikimedia.org/T356933
- Develop recovery/reindex procedures for new Search Update Pipeline - https://phabricator.wikimedia.org/T356803
WDQS graph splitting
- Finished rewriting 10 examples (https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split/Federated_Queries_Examples) including the one being discussed at https://github.com/WDscholia/scholia/issues/2423#issuecomment-1936978903 - https://phabricator.wikimedia.org/T357980
- Started working on better understanding federation limits
- Meeting with Scholia (notes: https://etherpad.wikimedia.org/p/wdqs-graph-split-2024-02-15-1)
- Need to put more doc on wiki (i.e. move https://etherpad.wikimedia.org/p/wdqs-graph-split-refinement-of-the-split-strategy to a wiki page)
- They want to continue experimenting with both possible outcomes in parallel
- rewritting with federation to still use WDQS
- investigate & setup QLever (using QLever requires also a non-negligible rewrite)
- Started to extract some numbers to help with refining the split (hierarchy of types to define what is a "publication" for wikicite and the number of items attached to it)
- Compare the results of sparql queries between the fullgraph and the subgraphs - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/WDQS_Graph_Split_Impact_Analysis / https://phabricator.wikimedia.org/T355040