Search Platform/Weekly Updates/2024-11-08
Appearance
Ongoing work
Search Update Pipeline / Weighted tags
- Finalized schema-validating wrapper for spark kafka writer (released as part of eventutilities 1.4.0). However, first runs with real data revealed limitations in handling null values: Spark data frames may hold null that have to be excluded from the JSON validated against an event-stream-schema, since they do not support excplicit nulls (via union types). - https://phabricator.wikimedia.org/T374341
- Enabled stream processing for weighted tags created by Growth scripts. Failed due to wrong output format (not compliant with schema) and had to roll back, but a fix is shipped and should roll with the next train. https://phabricator.wikimedia.org/T378983 / https://phabricator.wikimedia.org/T378664 / https://phabricator.wikimedia.org/T377150
Language Stuff: Kuromoji
- I've massaged all the Japanese tokenization data as much as I can and written up instructions for speaker evaluation. I've got one volunteer to look over the samples and evaluate the tokenization (both the Kuromoji and ICU tokenizers), and I've asked a couple more people.
- I'll keep asking around until I get at least 2 reviewers or run out of candidates. - https://phabricator.wikimedia.org/T318269
What we've accomplished
Misc / Operations
- PHP Warning: array_merge(): Expected parameter 1 to be an array, null given - https://phabricator.wikimedia.org/T377479