Search Platform/Weekly Updates/2024-09-20
Appearance
Summary
We closed a number of operational tasks and qulityof life improvements for our users this week.
Good progress was done on deploying ICU folding to most of the languages supported by Search.
What we've accomplished
WDQS graph splitting
- Reproduce and fix connectivity issue with the internal http client of blazegraph (https://phabricator.wikimedia.org/T373812)
- Write a bunch of subtasks for https://phabricator.wikimedia.org/T294133 (expose the wikidata rdf update stream)
Improve multilingual zero-results rate
General task: https://phabricator.wikimedia.org/T332342
- I have finished Nepali (after deciding not to do anything special with the occasional Tibetan script), Assamese, and Punjabi. Only Oriya is left.. then a quick re-check that everything works as expected, a quick code review, and a new patch Friday or Monday.
- ICU folding configs are done for Marathi, Burmese, Malayalam, Telugu, Sinhala, Kannada, & Gujarati.
- I was able to do some additional needed normalization for Marathi, Malayalam, Sinhala, and Gujarati, which is a very nice bonus. I did some minor refactoring so all those share a `case` statement, too.
- Nepali is in progress.. I think the configs for Devanagari are done, but I'm looking into the lesser-used Tibetan script (which occurs regularly on-wiki), and that config may be re-usable for the Tibetan langauge, too.
- After that Assamese, Punjabi, and Oriya are left. Each language config takes between 20 minutes and 2 days—though I'm averaging ~2½ configs per day, and hoping to finish the configs for the INdic languages by the end of the week.
Search Metrics
Misc / Operations
- Preparation of Q2 goals
- Preparation of Data Platform offsite
- T371929 Index all statements (without value) for all datatypes for haswbstatement - the change is deployed, but it will take up to 16 weeks until all items have been re-indexed
- T372030 Index statements in commons media datatype for haswbstatements - the change is deployed, but it will take up to 16 weeks until all items have been re-indexed
- T373778 NetworkSession and AbuseFilter may be spammy
- T369808 The Commons search "deepcategory" operator often does not work (Deep category query returned too many categories)
- T328330 Create SLI / SLO on Search update lag
- T371648 Run rebuildall.php on cswikivoyage
- T331127 phantom redirects lingering in incategory searches after page moves
- T374637 Decide how to make datasets owned by analytics-search-users also readable by analytics-privatedata-users
- T371401 Adapt search ranking for mul language code