Jump to content

Search Platform/Weekly Updates/2025-01-24

From Wikitech

Ongoing work

Language Stuff: Kuromoji/Sudachi

Search Update Pipeline / Weighted tags

MLR Improvements

  • I was able to train a "vanilla" xgboost with instance reweighting on a small sample. (~1% of itwiki for 2025-01) On this sample, I could test the "easy query" pipeine end-to-end, but I don't have meaningful results yet. I've been struggling with performance when trying to scale up to larger datasets. The main bottleneck is accessing the feature_vector table from spark; the table stores featues column-wise as a struct<double> in parquet, and this adds non trivial memory pressure even for small sizes. I'm now experimenting (timeboxed) with both spark/parquet tunables, as well as alternatives access patterns. - https://phabricator.wikimedia.org/T383048

Search Metrics

What we've accomplished

WDQS Expose RDF stream publicly

Those newly exposed streams allow use cases outside of the Wikimedia Foundation to consume changes to Wikidata in a much more straightforward way than previously (the previous option was polling the RecentChanges API, which is complicated and unreliable). This is particularly useful for people who want to maintain a live updated copy of Wikidata. We expect that multiple organizations working on RDF storage backend will be interested and will use those new streams to validate their implementations against the needs to Wikidata.

A more complete communication and documentation of those new streams will follow.

Search Update Pipeline / Weighted tags

  • Added an alert on the volume of weighted_tags that the search update pipeline is processing (https://phabricator.wikimedia.org/T373459) [DC]
    • More fine-grained alerts will have to be done by owners of those tags

Misc / Operations