Search Platform/Weekly Updates/2023-11-17
Appearance
Summary
Year-end vacation season and deployment freeze are coming, work is expected to somewhat slow down until January. We're still on track to deliver what we expected.
What we've accomplished
Improve multilingual zero-results rate
- Finished heuristics for merging Type and Script attributes (e.g., <ALPHANUM>/Latin + <NUM> = <ALPHANUM>/Latin; Latin + Cyrillic = Unknown; etc.). Abandoned making Script attributes merging more configurable (e.g., keep first, keep last, count characters), so every mixed token gets "Unknown" (we're limited to ICU script types, otherwise I'd go with "Mixed") - https://phabricator.wikimedia.org/T332337
- Lots of thinking about configurability of scripts & types to merge (e.g., don't merge <EMOJI> types; only merge <ALPHANUM> types; don't merge CJK scripts, etc.). Still thinking about "numbers only" option (because current behavior is an error wrt UAX #29) - https://phabricator.wikimedia.org/T332337
WDQS graph splitting
- Investigated TFT (https://github.com/BorderCloud/TFT) and found that it might not suit our needs for the graph splitting analysis, it does not have a handy ways to generate "test scenarios" and seems to be designed to work only against the test created by the w3c working group (https://github.com/w3c/rdf-tests). It's written in PHP and I don't think it'd be wise to add such functionality there - https://phabricator.wikimedia.org/T349519
- Investigation started on Iguana (https://iguana-benchmark.eu/), looks more promising, but needs more investigation before decision.
- Scholarly Article Split job is now deployed via Airflow and generating a working graph split - https://phabricator.wikimedia.org/T347989
- Work started on converting the graph split to a format that can be exported and ingested into Blazegraph - https://phabricator.wikimedia.org/T350106
Search Update Pipeline
- Starting backfilling test to validate functional correctness and that load on backend systems is appropriate - https://phabricator.wikimedia.org/T350826
- There are open questions about failure modes. Currently, some failures related to bad input data require manual intervention to recover. Automated recovery in a robust way isn't trivial. Note that at the moment, SUP has been running with production data for multiple days without issues, so failures due to data are at least somewhat rare.
- Helm charts created and validated by deployment - https://phabricator.wikimedia.org/T326328
- Improve the flink-app chart to provide more useful defaults - https://phabricator.wikimedia.org/T346315
Misc
- Java restarts for security updates - https://phabricator.wikimedia.org/T350703
- Deployment of Mjolnir on Python 3.9 - https://phabricator.wikimedia.org/T346373
- Deployment of WDQS Streaming Updater with Flink / k8s Operators in staging (not in production yet) -https://phabricator.wikimedia.org/T326409
- Decommission search-loader VMs (part of the migration to Debian Bullseye) - https://phabricator.wikimedia.org/T351123
- VisualEditor's Add a link should suggest a redirect with exact case match - https://phabricator.wikimedia.org/T346920