Search Platform/Weekly Updates/2024-02-09
Appearance
Summary
Further investigation of failed queries on the WDQS main graph shows that most are coming for a few sources, which gives us some confidence that we can improve the situation significantly by focusing on a small number of use cases.
Other projects are moving along nicely.
What we've accomplished
Improve multilingual zero-results rate
- ICU token repair corpus is built and daily diffs are running. Reviewing diffs from enabling the ICU tokenizer. Mostly looks good, but there are a few things to track down. (Malayalam has the most unusual results and I'm having a little trouble figuring out what's going on—diffs from my regresion test set aren't reproducing easily in focused testing. I'll get to the bottom of it eventually.)
WDQS graph splitting
- A draft of the analysis is available on wiki. We confirm the numbers from last week that showed some categories of queries having a high failure rate on a split graph. This is mitigated by the large majority of failed queries coming from a small number of user agents (the top 5 user agent account for >90% of failures), which indicates that it is likely that a targeted effort can reduce the number of failures significantly. This will need a qualitative analysis which is planned as a next step. https://wikitech.wikimedia.org/wiki/User:DCausse/WDQS_Graph_Split_Impact_Analysis
- Intermediary report on query performance is published. This analysis covers only queries that run on the main graph (without requiring federation) and shows a modest performance improvement compared to running on the full graph. This confirms our initial assumptions. https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update/Graph_split_IGUANA_performance
- February update is published, inviting feedback from our communities: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update/February_2024_scaling_update
- Project page is published: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split
Misc
- Investigated, restarted and back filled failed data pipeline. https://phabricator.wikimedia.org/T356030
- We participated to a Unicode Consortium meeting about the Foundation's membership. Nothing concrete yet, but a lot of good will and promises to do introductions and work together in the future. This is especially timely with our current work on ICU token repair.