Jump to content

Search Platform/Weekly Updates/2024-09-06

From Wikitech

Summary

We have announced the availability of new Graph Split endpoints: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update/September_2024_scaling_update. We will be monitoring communication channels, answer questions and start supporting the transition.

Just after our announcement that we will be moving to OpenSearch due to Elasticsearch licensing issues, Elastic announced that they are movign to AGPL for Elasticsearch. We will need a bit more time to investigate the implication, but it is likely that we'll upgrade to the latest AGPL Elasticsearch in the near future.

Deep dive presentation of the new Seach Update Pipeline. The slides are available publicly, but not the recording (see below). This presentation covers the Search Update Pipeline, but addresses a number of generic issues that might apply in other pipelines.

What we've accomplished

WDQS Graph Split

Search Update Pipeline / Private Wikis

  • NetworkSession is having some odd behaviour related to sessions. It is supposed to skip session storage, but something is causing anon sessions to be created and stored, but then overridden by NetworkSession to be logged in. This is potentially causing issues for general logged-in users due to the volume. For now NetworkSession deployment has been reduced to only private wikis, but we want to have it turned on everywhere for consistency and our ability to notice future problems. - https://phabricator.wikimedia.org/T373826
  • Possibly related NetworkSession issue where the user autocreation could be blocked by AbuseFilter. Patches up to resolve. - https://phabricator.wikimedia.org/T373778
  • Progress on moving wikitech into k8s has caused wikitech to start emitting events to the streams we read. Unfortunately we can't contact wikitech through mw-api-int yet, which caused the SUP producer to stop shipping events. Fix developed and deployed which explicitly excludes wikitech from processing.
  • The Saneitizer fix rate has been climbing in the past couple weeks but we don't know why. Shipped a patch to collect more granular metrics, and updating dashboards to use the.
  • Adapt CirrusSearch to support new page_change_weighted_tags stream as alternative outlet for weighted tags, https://phabricator.wikimedia.org/T372904
  • Prepared and presented a deep dive "Search Update Pipeline - Feeding search with Flink stream processing"; Slides: https://docs.google.com/presentation/d/1m_cfy3NoagPULOsr70E7vaHY63QcqRrkOQLFTuVFK9c/edit?usp=sharing (Recording: https://drive.google.com/file/d/1-Zwe7W7fAzfmhB_4DWU89hVfq8uXmZgO/view)

Improve multilingual zero-results rate

  • Spinning wheels a bit on Marathi, which uses the Devanagari script, because Hindi also uses it—but my previous analysis of Hindi didn't indicate anything. Turns out hindi_normalization filter is more aggressive than expected. (Historically enabled language analyzers have never gotten the review that new analyzers get before being enabled... though it's not like they are necessarily less likely to have errors in them.) I don't know if I'll be able to get a patch up for the new languages before I go on vacation.
  • ASCII folding/ICU folding harmonization patches were submitted, SonarQube was wrestled into submission (though, technically, I think SonarQube won), and David approved the patches (thanks!). ICU folding config is committed for 20 new languages!

Search backend replacement

  • Inquired on Elasticsearch OSI-compatible license and its relicensure.

Misc

  • Updated WikibaseCirrusSearch to support `haswbstatement:P<n>` for all properties. The code ships now, but it will be 16 weeks before all of wikidata is rerendered with the additional properties. This does not change the `haswbstatement:P<n>=<value>` support, those are still limited. https://phabricator.wikimedia.org/T371929
  • Identified and fixed a problem with the Cirrus Saneitizer that was not correctly fixing redirects indexed into the wrong index. They should be fixed over the next 2 weeks as the Saneitizer does a loop. - https://phabricator.wikimedia.org/T331127
  • It turns out some functionality we developed for Wikibase some years ago which allows customized per-language profiles for wbsearchentities was lost in the split from Wikibase into WikibaseCirrusSearch. Re-introduced the functionality, ships next week. - https://phabricator.wikimedia.org/T371401