Ongoing work

WDQS Graph Split

Working on adapting the sparql query logging to the split graph endpoints - https://phabricator.wikimedia.org/T376134

I think I've optimized the Kuromoji settings to fix the known problems that don't require a native speaker to understand. I'm working on a tool to parse example sentences and queries from Japanese Wikipedia and Wiktionary. And because I want to refresh some skills, I'm doing it in Python. (It's going, but it's a little slow.) I'm also comparing the ICU tokenizer vs the Kuromoji tokenizer in case that can solve other problems once speakers are involved. - https://phabricator.wikimedia.org/T318269
- I hope to be done with the tokenizing and prepping the text for review tomorrow (Friday), and I'll make the initial request for help from native speakers within the Foundation, and hopefully have instructions to send out to them on Monday.
- Fun fact: Python uses lowercase \u for 4-digit UTF-8 codes: \u2620 (☠) and uppercase \U for 8-digit UTF-16 codes: \U0001F480 (💀). This solves the problem of high surrogates and low surrogates—but creates a problem for the programmer who doesn't know about it. (Of course I know him—he's me!)

Finished the work to adapt the flink jobs to event platform best practices... deployment might be tedious, I switched to the page_change stream instead of the old rev-create/page-[delete|suppress] streams and thus might need a carefull/manual deploy... - https://phabricator.wikimedia.org/T374918
- still working on few config patches to declare the stream in mw-config

Tansitioned processing of weighted tags, created by Growth, to Search Update Pipeline; no more direct writes from CirrusSearch to Elasticsearch - https://phabricator.wikimedia.org/T377150 / https://phabricator.wikimedia.org/T372904
Finished work on spark-kafka-writer (with schema validation)
- Created upstream ticket (https://issues.apache.org/jira/browse/SPARK-50160) + PR (https://github.com/apache/spark/pull/48695) but neigther has seen any feedback yet