Search Platform/Weekly Updates/2023-07-14
Appearance
Summary
Happy Bastille Day!
What we've accomplished
Search Update Pipeline
- Met with Service Ops, Data Engineering, and Event Platform team and decided to use Zookeeper for Flink HA.
- Stop sending obviously broken queries to elasticsearch. Did this because of an automated client shipping the same bad query over and over and "massivily" (2.5 q/s) polluting our rejected metrics
- Gather some numbers on the local_sites_with_dupe flag, it seems to me that this array contains mostly false positives (70% for enwiki). Discuss another approach that does not require any work at index time . Ideally, this should help us avoid using this feature in the new pipeline design.
- Squash bug related to how we cache cirrus docs in memcached.
Improve Multilingual Zero-Results Rate
Misc
- Trey (our favorite computational linguist) presented analyzer unpacking at Tech & Product Meeting. It was both informative and entertaining! Selina has requested a brown bag on this topic.
- Chugging through word_break_helper candidates and looking at word_break_helper's interference with Korean/Nori and Chinese/SmartCN tokenizers.