Jump to content

Search Platform/Weekly Updates/2025-01-31

From Wikitech

Ongoing work

MLR Improvements

  • T383048 Investigate current MLR models for Search and identify improvements: the "easy queries" exploration is in review.
  • There's a report, in notebook form, available at https://gitlab.wikimedia.org/repos/search-platform/notebooks/-/merge_requests/6. In short, reweighting training data with query "easiness" did not have the impact we expected. Nonetheless, data analysis shows that current mjolnir models perform well (ndcg) on both "hard" and "easy" queries. Following our discussion on Wednesday, the notebook has been update with info about low ndgc scores (queries are omitted from the renderer to avoid exposing PII). Given the outome of this work, there's three avenues for improving MLR that we are considering:
    • incorporate abdandoned queries in the training pipeline (requires further research)
    • extend MLR to more wikis
    • invest in infra improvement to support more frequent model rollouts

As discussed on Wednesday we'll have chance to touch base and plan next steps at the upcoming offiste

Operations / Misc

Language Stuff: Kuromoji/Sudachi

  • Spent a good bit of Friday synthesizing my Sudachi findings into a semi-objective list of either errors or reasonable options/features they could provide, and opened a ticket on their repo. No response yet... [TJ]

https://github.com/WorksApplications/elasticsearch-sudachi/issues/156

  • Currently updating our config for Sudachi without a custom dictionary to get a sense of what can and can't be done, so I know how much hassle a custom dictionary is worth, and so I can commit a config that encapsulates all the investigation so far.