Search Platform/Weekly Updates/2023-07-21
Appearance
Summary
The team is working on the Search Update Pipeline, and improvements to multilingual zero-results rate.
We are formalizing the collaboration between Data Engineering and Search Platform around Flink deployments, which should help move forward better on the Search Update Pipeline work.
What we've accomplished
Improve multilingual zero-results rate
- Some issues with testing and continuous integration (test coverage was reduced by moving unit tests to integration tests by another patch)
- Language, Harmony, and Unpacking Q&A session is scheduled for Monday, July 24, 2023 at 16:00 UTC (9 am PDT, Noon EDT, 6pm CEST). Have a look at the initial presentation (https://upload.wikimedia.org/wikipedia/commons/d/d5/Language%2C_Harmony%2C_and_Unpacking%E2%80%94Trey_Jones%2C_July_2023.webm) and add your questions to this etherpad: https://etherpad.wikimedia.org/p/Language,_Harmony,_and_Unpacking%E2%80%94Part_IS
- Handle variation in apostrophe-like characters better, see the full write up on wiki - https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Analyzer_Harmonization_Notes#Apostrophes_(T315118) / https://phabricator.wikimedia.org/T315118
- Investigate applying aggressive_splitting everywhere, not just on English-language wikis, see the full write up on wiki - https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Analyzer_Harmonization_Notes#aggressive_splitting_(T219108) / https://phabricator.wikimedia.org/T219108
Search Update Pipeline
- Ongoing work on adding support for page re-render to the pipeline: https://gitlab.wikimedia.org/dcausse/cirrus-streaming-updater/-/commits/page-rerender-wip
- Re-order and optimize change events - https://phabricator.wikimedia.org/T325672
- Summary of the use of Kafka topics is documented at https://docs.google.com/spreadsheets/d/1Fp44MdLxUVlxi03MBD_64m0zQErny-9jUD5C6RGf_bU/edit#gid=1175302241 so that we can better coordinate with other teams on our requirements
Misc
- Fixed bug in Wikidata that was preventing undeletes to reappear in WDQS - https://phabricator.wikimedia.org/T341905
- Adding tests to CirrusSearch to prevent an incident similar to https://wikitech.wikimedia.org/wiki/Incidents/2023-06-18_search_broken_on_wikidata_and_commons / https://phabricator.wikimedia.org/T339935
- Add outlink topic model predictions to CirrusSearch indices, this is done to support ML team in deprecating ORES - https://phabricator.wikimedia.org/T328276