Search Platform/Weekly Updates/2024-08-23
Appearance
Summary
We've been tying up a few loose ends on the Search Update Pipeline. One of the improvements of that new pipeline is the introduction of a standardized way to process weighted tag updates (used for example by the Growth team for Link Recommendations). This is timely as we've seen a few issues indexing those recommendation and simplifying the process will make it more robust.
Data is being reloaded for the WDQS Graph Split, and most of the traffic routing is in place. We hope to announce those new endpoints next week.
What we've accomplished
WDQS graph splitting
- Load balancing configuration for production is underway. https://phabricator.wikimedia.org/T373145 https://phabricator.wikimedia.org/T364368 https://phabricator.wikimedia.org/T364364
- More testing to be done as domain names and load balancing are connected. https://phabricator.wikimedia.org/T370754
Search Update Pipeline / Private Wikis
- Fix redirect processing https://phabricator.wikimedia.org/T372446
- Adapt to CirrusSearch update job behavior: Only process intra-namspace redirects and those coming out of main (0)
- Avoid redundant prefixes, since redirect targets in page_change events already have prefixed titles
- Fixed orchestrator script: Run any pending backfills before shutting down for good - https://phabricator.wikimedia.org/T372128
- Proved the hypothesis that we loose events in the producer (https://phabricator.wikimedia.org/T372362) wrong. Turns out this suspicion was caused by wrongly counting events going in and out.
- Created followup tasks for adopting the new page_weighted_tags_change stream (https://phabricator.wikimedia.org/T366253) and arranged a meeting with Growth Team. Follow up on simplification of this architecture and visibility in https://phabricator.wikimedia.org/T372904 and https://phabricator.wikimedia.org/T373140
Improve multilingual zero-results rate
- Finished refactoring the asciifolding/icu_folding code https://phabricator.wikimedia.org/T332342 / https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Analyzer_Harmonization_Notes#Part_4%E2%80%94Refactoring_&_Analysis_Notes
- Looking at expanding ICU folding to other languages
Vector based search
- A follow up coordination meeting on a WE.3.1 focused technical spike for vector embeddings is scheduled.
Misc
- Search backend replacement: Email sent for advance deprecation notice. https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/SXK2FODVJP3R3GRSYE2V2ODPUO74YLFO/
- CirrusSearch: Add support for sorting results by naturally sorting page titles (note that this will not be activated on WMF wikis, but is available for 3rd parties) - https://phabricator.wikimedia.org/T371458
- Decision on migrating from Elasticsearch to OpenSearch is documented on https://www.mediawiki.org/wiki/Wikimedia_Search_Platform/Decision_Records/Search_backend_replacement_technology - https://phabricator.wikimedia.org/T370661