Jump to content

Search Platform/Weekly Updates/2025-01-10

From Wikitech

Ongoing work

Search - Machine Learning Ranking (MLR)

  • T377128 Import recent MLR models built by MjoLniR in production and test them: We ran an A/B test on 18 wikis to see what improvements we can expect from retraining our MLR models. The results of the tests are available (https://people.wikimedia.org/~ebernhardson/T377128/) and they look great! We see a 2% increase in search satisfaction in some wikis (2% improvement in Search is a pretty big deal). We want to do more work on checking how much those models are biased toward very simple queries.
  • T383048 Investigate current MLR models for Search and identify improvements:
    • I've been working with samples from enwiki, itwiki, dewiki, and nlwiki (TODO: include frwiki) to identify heuristics for "easy queries" detection. The focus lately has been on data analysis, where I’m comparing several heuristics for (query, title) similarity, including cosine, Jaccard, and Levenshtein metrics.
    • So far, the assumption that we have a lot of easy queries seems valid. However, the value of "a lot" depends on the similarity metric applied. I’m iterating on data analysis to identify cutoffs for labeling a query as easy or not. Binary classification seems like a reasonable baseline for this task. I need to polish the notebooks and brainstorm with the team to validate the approach, but I have a plan of attack in mind.

Language Stuff: Kuromoji/Sudachi

  • T318269 Test and analyze Kuromoji Japanese language analyzer
    • Reviewer finished looking at Sudachi tokenization and it is clearly better than Kuromoji. Kuromoji was probably going to be okay to deploy, but Sudachi is definitely good (in terms of accuracy). However it is noticeably slower, and it has even more weird edge-case behavior than Kuromoji, so we need to discuss how we should configure & deploy it.
    • After some team discussion, we've decided that Sudachi's reindexing slowness is worth it for the accuracy improvements over Kuromoji (since it only affects Japanese wikis). However, I plan to investigate ways of possibly mitigating Sudachi's slowness, current lack of OpenSearch support (2.x but not 1.x), and edge-case parsing quirks by looking at whether we can customize the Sudachi default dictionary, either for use with Sudachi or for Kuromoji. Draft Q3 goals have been updated to reflect this. Before that I plan to finish writing up and publish my Kuromoji and Sudachi findings to date for my Notes.

Search Metrics

WDQS Expose RDF stream publicly

Misc / Operations

  • Our airflow scheduler has moved to k8s but we have few dags now failing because the hive CLI is not available there, trying to see if I can quickly migrate some to spark sql.
  • Investigated mjolnir issue in the feature selection task with Gabriele and Joseph, after a deeper look at the logs the spark drivers is failing with an OOM and not terminating, according to Joseph the spark job is contructing a job graph that's very big and might benefit from being reworked a bit possibly using checkpoints. In the meantime we might just try to bump the driver mem settings.

What we've accomplished

Operations / Misc