Search Platform/Weekly Updates/2024-03-01
Summary
Search Update Pipeline: we have 100% of writes using the new SUP for Cloudelastic. Operational concerns are solved, we will start deploying to production indices soon. The target of 90% of the update traffic migrated to SUP by the end of the quarter is likely to overflow to Q4 by a few weeks.
WDQS Graph Splitting: We are making progress in understanding better the Scholia use cases and helping them move forward by reviewing some scholia queries as a proof of concept that SPARQL federation is a viable alternative and by providing documentation on how to rewrite queries to use the graph split. The complexity isn't yet under control, both from a technical standpoint and from a change management standpoint. Scholia is also exploring alternate solutions, including running their own query service with a different RDF backend.
We are doing experimentation with different hardware configuration to better understand how we could gain performance in loading the graph by throwing hardware at the problem.
What we've accomplished
WDQS graph splitting
- Started a list of people that might help/be affected by the split to contact, Adam contacted one that might join the scholia/wikicite group, Luca should take care of the others.
- Discussion with scholia/wikicite:
- Focus on refining the split, started https://etherpad.wikimedia.org/p/wdqs-graph-split-refinement-of-the-split-strategy
- Asked for a spreadsheet with all the queries to get a sense of the amount of work to be done with some estimation of the complexity and "importance". I might start the spreadsheet and ask them to fill this up.
- Working on federated queries examples [DC]
- Got one scholia query working (https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split/Federated_Queries_Examples#Property_paths). This query has bugs (https://github.com/WDscholia/scholia/issues/2388 which I fixed because this bug got worse with federation) and the co-occurrences it presents seem suspicious but the point is to prove that federation is feasible not fixing the query... I had to give up on the label service here as I could not get it to work without timeout... Unclear why.. but the label service is not available in QLever so that is less of an issue I guess?
- Potential hardware performance improvements:
- AWS Neptune serverless completed latest-all.nt.bz2 in a total of 63 hours . AWS Neptune with a provisioned high power server (1.5 TB RAM, 96 vCPU) showed a speed increase of perhaps 60% over the serverless option, but was stopped to avoid further costs after an initial import on latest-all.nt.bz2 up to about 1.3B records.
- https://phabricator.wikimedia.org/T358727 has been opened to test import on a server that already has an NVMe
- AWS-based import speed tests. For example, AWS Neptune serverless with a maximum of 128 NCUs (an NCU is said to be 2 GB RAM and some level of attendant CPU) processed 7_750_230_000 records for window 26-February-2024 2:21:11 PM CT - 27-February-2024 1:46 PM CT; this import is ongoing. This is using the file latest-all.nt.bz2 from 16-February-2024, and as of this writing is utilizing about 70% of allocated CPU. An AWS Neptune import of latest-lexemes.nt.bz2 processed 163_715_491 records in 2142 seconds; the CPU utilization didn't seem to peak out during this import and was achieving about 50% utilization. EC2 imports seem capable of almost approaching the speed of an i7-8700 desktop gaming computer with 64 GB of RAM and attached NVMe, but thus far haven't shown a necessarily faster import possibility, just evidence that NVMe-based disks and faster CPUs both play a role in import speed, which is unsurprising, but worth validation.
Search Update Pipeline
- Enabled 100% of writes via the new SUP for Cloudelastic - https://phabricator.wikimedia.org/T358518
Improve multilingual zero-results rate
- built my regression test set for the dotted I (İ) fix task (https://phabricator.wikimedia.org/T358495) and did a quick test. It only does good things as long as we keep it away from languages that actually use dotted I and dottless i (İ/i and I/ı—this font is terrible for this!). The next thing is configuring it efficiently everywhere, while looking at removing it from configs that use icu_folding (which makes it redundant), and looking at how best to do İ/i and I/ı lowercasing (Turkish lowercasing or a quick mapping?) for the languages that need it.