Search Platform/Weekly Updates/2025-01-24
Appearance
Ongoing work
Language Stuff: Kuromoji/Sudachi
- Added speaker review stats and analysis. Sudachi, despite its quirky quirks, is clearly the better tokenizer. Up next I want to commit my custom config for Kuromoji and Sudachi, then I will close the ticket (T318269) and open a new one for reviewing dictionaries, and try to figure out how to use a custom dictionary for either Kuromoji or Sudachi.
Search Update Pipeline / Weighted tags
- Started to remove few deprecated methods (https://phabricator.wikimedia.org/T374702) [DC]
- few more need to be removed
MLR Improvements
- I was able to train a "vanilla" xgboost with instance reweighting on a small sample. (~1% of itwiki for 2025-01) On this sample, I could test the "easy query" pipeine end-to-end, but I don't have meaningful results yet. I've been struggling with performance when trying to scale up to larger datasets. The main bottleneck is accessing the feature_vector table from spark; the table stores featues column-wise as a struct<double> in parquet, and this adds non trivial memory pressure even for small sizes. I'm now experimenting (timeboxed) with both spark/parquet tunables, as well as alternatives access patterns. - https://phabricator.wikimedia.org/T383048
Search Metrics
- T375387 Include fulltext search results Page Previews of sufficient dwell time in Search Metrics dashboard: after investigation, we see that page previews are not impacting our abandonment metrics in a significant way. If we start tracking engagement by source, it might make sense to track previews more closely, but for the moment, we're closing this.
What we've accomplished
WDQS Expose RDF stream publicly
- T374919 Adapt the rdf-streaming-updater flink job to use wikimedia-eventutilities-flink
- T374921 Configure https://stream.wikimedia.org to expose rdf-streaming-updater.mutation
- T382065 Add support for active/active double compute streams in the EventStreams HTTP service
- Deployed new version of the updater to take benefits of eventutilities API but also consume from the page_chage.v1 stream - https://phabricator.wikimedia.org/T382065
- Exposed the streams publicly via https://stream.wikimedia.org/ - https://phabricator.wikimedia.org/T374921
- rdf-streaming-updater.mutation.v2
- rdf-streaming-updater.mutation-main.v2
- rdf-streaming-updater.mutation-scholarly.v2
- rdf-streaming-updater.mutation-scholarly.v2
- mediainfo-streaming-updater.mutation.v2
Those newly exposed streams allow use cases outside of the Wikimedia Foundation to consume changes to Wikidata in a much more straightforward way than previously (the previous option was polling the RecentChanges API, which is complicated and unreliable). This is particularly useful for people who want to maintain a live updated copy of Wikidata. We expect that multiple organizations working on RDF storage backend will be interested and will use those new streams to validate their implementations against the needs to Wikidata.
A more complete communication and documentation of those new streams will follow.
Search Update Pipeline / Weighted tags
- Added an alert on the volume of weighted_tags that the search update pipeline is processing (https://phabricator.wikimedia.org/T373459) [DC]
- More fine-grained alerts will have to be done by owners of those tags
Misc / Operations
- Enable wikitech on the SUP (no tickets)
- this was the last wiki still being processed by the SUP (wikitech became a production end of last year and is now using the event platform)
- 100% of the wikis are now handled by the SUP
- Airflow issues after the k8s migration
- A number of fixes have been contributed to skein, that stabilized mjolnir. The dag is now running sucesfully (https://phabricator.wikimedia.org/T383589).
- There is an MR in flight to fix python script (refinery) execution on k8s. This will unblock drop_data_daily execution https://phabricator.wikimedia.org/T384255
- T376440 Deepcategory search does not show any results on Commons instead of results up to the configured limits
- T379046 WeightedTagsUpdater should indicate success of the update
- T383938 Investigate and tune mjolnir resource allocation