Search Platform/Weekly Updates/2025-01-31
Appearance
Ongoing work
MLR Improvements
- T383048 Investigate current MLR models for Search and identify improvements: the "easy queries" exploration is in review.
- There's a report, in notebook form, available at https://gitlab.wikimedia.org/repos/search-platform/notebooks/-/merge_requests/6. In short, reweighting training data with query "easiness" did not have the impact we expected. Nonetheless, data analysis shows that current mjolnir models perform well (ndcg) on both "hard" and "easy" queries. Following our discussion on Wednesday, the notebook has been update with info about low ndgc scores (queries are omitted from the renderer to avoid exposing PII). Given the outome of this work, there's three avenues for improving MLR that we are considering:
- incorporate abdandoned queries in the training pipeline (requires further research)
- extend MLR to more wikis
- invest in infra improvement to support more frequent model rollouts
As discussed on Wednesday we'll have chance to touch base and plan next steps at the upcoming offiste
Operations / Misc
- Investigate possible impact of changing some of the RDF prefixes in wikibase
- Started working on migrating the search update pipeline to non-dev schema&stream names - https://phabricator.wikimedia.org/T375821
- this involves removing "development" from schema names, writing to/reading from ".v1" stream instead of ".rc0"
- might be a bit more tedious than anticipated because we have existing users that we don't want to break (i.e. the weighted tags ingestion stream)
- T384255 python_script_executor should run with KubernetesPodOperator: added support to airflow common libs to execute refinery scripts on k8s. [GM]
- removes a blocker for the search instance migration.
Language Stuff: Kuromoji/Sudachi
- Spent a good bit of Friday synthesizing my Sudachi findings into a semi-objective list of either errors or reasonable options/features they could provide, and opened a ticket on their repo. No response yet... [TJ]
https://github.com/WorksApplications/elasticsearch-sudachi/issues/156
- Currently updating our config for Sudachi without a custom dictionary to get a sense of what can and can't be done, so I know how much hassle a custom dictionary is worth, and so I can commit a config that encapsulates all the investigation so far.