Search Platform/Weekly Updates/2023-09-29
Appearance
Summary
Last week of this quarter!
A Kafka outage led to WDQS not processing updates. This had some impact on our overall work, in particular we could not deploy some changes for testing. This also helped us discover an issue with lack of isolation between test and production pipelines.
Search Update Pipeline work is focused on deployment and operational improvements. We're waiting for a first production deployment to validate the full integration.
Improve multilingual zero-results rate work is about improving performances while not degrading the accuracy of the new filters. We will need a new plugin to distribute this work.
Are goals for next quarter (Oct-Dec) are being finalised and will be published shortly.
What we've accomplished
Search Update Pipeline
- Upgrade to flink 1.17 and refactoring to extract code shared between aggregator and indexer - https://phabricator.wikimedia.org/T346719
- Testing various Flink operations on the DSE experimental cluster https://phabricator.wikimedia.org/T342149
- Improve concurrency limits configuration of the wdqs updater - https://phabricator.wikimedia.org/T346456
- Create a schema for fetch failures - https://phabricator.wikimedia.org/T317609
Improve multilingual zero-results rate
- Decision on how to distribute new filters in a new plugin - https://phabricator.wikimedia.org/T346051#9204248
- Various accuracy and performance tests of the performance optimized filters vs the original regex. Both are looking good.
Operations
- Small WDQS lag outage (<2hours), initially caused by mirrormaker failing between kafka-main and kafka-jumbo. WDQS was still available but not updated with the latest edits on Wikidata - https://wikitech.wikimedia.org/wiki/Incidents/2023-09-27_Kafka-jumbo_mirror-makers
- As part of the above outage, we discovered a lack of isolation between test and production pipelines - https://phabricator.wikimedia.org/T347515
- Fixed issue causing false-positive alerts on the non-active CirrusSearch datacenter due to low traffic volume: https://phabricator.wikimedia.org/T347341
- Tune process_sparql_query_hourly so that it does not get killed by yarn - https://phabricator.wikimedia.org/T347333