Search Platform/Weekly Updates/2024-02-02
Summary
The team is in reduced capacity for most of this quarter.
Search Update pipeline is deployed end to end on Cloudelastic for ~25% of wikis, if no issues are found, we'll start to migrate the Search production next.
ICU Token Repair (as part of Improve multilingual zero-results rate) is close to completion. A good write up on the project is available: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Analyzer_Harmonization_Notes#ICU_Token_Repair_(T332337).
Intermediate analysis of the queries for WDQS graph splitting show results that are not as good as expected. For some user agents, we see more queries than we expected that are either failing on the main graph or returning different results. This requires additional investigation with a more qualitative approach.
Communication about the test end points for WDQS graph splitting was planned for mid-January, it is almost ready and will go out next week.
What we've accomplished
Improve multilingual zero-results rate
- Plugin code complete, minor code review in progress, deployment should be possible starting next week - https://phabricator.wikimedia.org/T332337
- Detail write up of the project on https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Analyzer_Harmonization_Notes#ICU_Token_Repair_(T332337)
WDQS graph splitting
- Create a tool that records and compares a set of sparql query results - https://phabricator.wikimedia.org/T351819
- Quantitative analysis of SPARQL queries from logs. Jupyter notebook is available, but requires some more work to be easily readable - https://people.wikimedia.org/~dcausse/T355040_EARLY_DRAFT_wdqs_query_results_analysis.html - https://phabricator.wikimedia.org/T355040
- Main points:
- A more qualitative analysis will be required for some queries that are not easily categorized as working or failing
- MixNMatch is producing a high number of queries failing on a single graph. This requires further analysis and might be related to a specific upload campaign and not reflective of the standard query load
- more details on https://phabricator.wikimedia.org/T355040#9509621
- Main points:
- Performance analysis preliminary results show some minor performance improvements to queries running on the main graph vs the full graph - https://phabricator.wikimedia.org/T355037#9506807
Search Update Pipeline
- Deployed SUP producer at full rate in both datacenters. Deployed cloudelastic consumer in eqiad at ~25% final rate. Monitoring results. https://phabricator.wikimedia.org/T352335 / https://phabricator.wikimedia.org/T350186 / https://phabricator.wikimedia.org/T354793
Misc / Operations
- Manually disabled the image_suggestions_weekly airflow while an issue with cassandra and the SD team image suggestion job is sorted out - https://phabricator.wikimedia.org/T356400
- This is tangentially related to previous similar issues (https://phabricator.wikimedia.org/T356030) and shows the overall fragility of our data pipelines
- Do not display <languages /> content as search excerpt. Note that once deployed this will not instantly fix the pages. The pages will be fixed on the next edit, or when the background reindexer gets to the page (once every ~16 weeks). - https://phabricator.wikimedia.org/T352915