Search Platform/Weekly Updates/2025-01-10
Appearance
Ongoing work
Search - Machine Learning Ranking (MLR)
- T377128 Import recent MLR models built by MjoLniR in production and test them: We ran an A/B test on 18 wikis to see what improvements we can expect from retraining our MLR models. The results of the tests are available (https://people.wikimedia.org/~ebernhardson/T377128/) and they look great! We see a 2% increase in search satisfaction in some wikis (2% improvement in Search is a pretty big deal). We want to do more work on checking how much those models are biased toward very simple queries.
- T383048 Investigate current MLR models for Search and identify improvements:
- I've been working with samples from enwiki, itwiki, dewiki, and nlwiki (TODO: include frwiki) to identify heuristics for "easy queries" detection. The focus lately has been on data analysis, where I’m comparing several heuristics for (query, title) similarity, including cosine, Jaccard, and Levenshtein metrics.
- So far, the assumption that we have a lot of easy queries seems valid. However, the value of "a lot" depends on the similarity metric applied. I’m iterating on data analysis to identify cutoffs for labeling a query as easy or not. Binary classification seems like a reasonable baseline for this task. I need to polish the notebooks and brainstorm with the team to validate the approach, but I have a plan of attack in mind.
Language Stuff: Kuromoji/Sudachi
- T318269 Test and analyze Kuromoji Japanese language analyzer
- Reviewer finished looking at Sudachi tokenization and it is clearly better than Kuromoji. Kuromoji was probably going to be okay to deploy, but Sudachi is definitely good (in terms of accuracy). However it is noticeably slower, and it has even more weird edge-case behavior than Kuromoji, so we need to discuss how we should configure & deploy it.
- After some team discussion, we've decided that Sudachi's reindexing slowness is worth it for the accuracy improvements over Kuromoji (since it only affects Japanese wikis). However, I plan to investigate ways of possibly mitigating Sudachi's slowness, current lack of OpenSearch support (2.x but not 1.x), and edge-case parsing quirks by looking at whether we can customize the Sudachi default dictionary, either for use with Sudachi or for Kuromoji. Draft Q3 goals have been updated to reflect this. Before that I plan to finish writing up and publish my Kuromoji and Sudachi findings to date for my Notes.
Search Metrics
- T375387 Include fulltext search results Page Previews of sufficient dwell time in Search Metrics dashboard: Updated https://superset.wikimedia.org/superset/dashboard to include the data regarding virtual PV (users hovering a link in the SERP and reading the popup), only 1% of the abandonned sessions get an interaction like this, unsure if this is enough to justify investigating more what's happening with those but might be interesting to have a section regarding the different kind of engagements we get on the SERP instead of adding this data to the Abandonment section.
WDQS Expose RDF stream publicly
- T374919 Adapt the rdf-streaming-updater flink job to use wikimedia-eventutilities-flink: Polishing some patches thanks to Gabriele reviews
Misc / Operations
- Our airflow scheduler has moved to k8s but we have few dags now failing because the hive CLI is not available there, trying to see if I can quickly migrate some to spark sql.
- Investigated mjolnir issue in the feature selection task with Gabriele and Joseph, after a deeper look at the logs the spark drivers is failing with an OOM and not terminating, according to Joseph the spark job is contructing a job graph that's very big and might benefit from being reworked a bit possibly using checkpoints. In the meantime we might just try to bump the driver mem settings.
What we've accomplished
Operations / Misc
- T382620 The Search/articletopic page at Wikitech appears to be out of date (supporting investigation by the Growth team)
- T377546 MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic: we're not able to reproduce the issue, it might have been fixed already.
- T382620 The Search/articletopic page at Wikitech appears to be out of date