Search Platform/Weekly Updates/2024-05-10
Appearance
Summary
Lot of public holidays this week across the team.
Work is ramping down on Search Update Pipeline and ramping up on the implementation of the WDQS Graph Split.
What we've accomplished
WDQS graph splitting
- A Signpost update on the graph split and the size of the WDQS graph is drafted and forthcoming - https://en.wikipedia.org/wiki/User:Bluerasberry/signpost_wikicite
- Benchmarking of CPU governor and BlazeGraph configuration variable complete. It took 2.15 days to load the scholarly graph with this configuration, compared to the original configuration taking 5.875 days; the "main" (not scholarly) graph would be slightly slower, but in the same ballpark of 2-3 days. Although further performance gain would likely be achievable with an NVMe based on behavior on another computer, it isn't possible right now to install the NVMe to replicate this. The performance gain has provided a bigger buffer for ensuring imports in an even more timely fashion, though, already. https://phabricator.wikimedia.org/T362920
- Adapted the import_ttl dag to always do the graph split and parquet -> n3 transformation - https://phabricator.wikimedia.org/T362060
- Working on adapting the data-reload cookbook to source its data from HDFS - https://phabricator.wikimedia.org/T349069
Search Metrics
- Notebooks posted at https://gitlab.wikimedia.org/repos/search-platform/notebooks/-/tree/main/cirrus covering a range of behavior on search-powered endpoints. https://phabricator.wikimedia.org/T358345
Search Update Pipeline
- Quick hack on the SUP producer to workaround what seems to be a flink bug - https://issues.apache.org/jira/browse/FLINK-22425
Operations
- WDQS updater misbehaving in codfw - https://phabricator.wikimedia.org/T362508
- WDQS lag propagation to wikidata not working as intended - https://phabricator.wikimedia.org/T360993