Jump to content

Search Platform/Weekly Updates/2024-11-01

From Wikitech

Ongoing work

WDQS Graph Split

Language Stuff: Kuromoji

  • I think I've optimized the Kuromoji settings to fix the known problems that don't require a native speaker to understand. I'm working on a tool to parse example sentences and queries from Japanese Wikipedia and Wiktionary. And because I want to refresh some skills, I'm doing it in Python. (It's going, but it's a little slow.) I'm also comparing the ICU tokenizer vs the Kuromoji tokenizer in case that can solve other problems once speakers are involved. - https://phabricator.wikimedia.org/T318269
    • I hope to be done with the tokenizing and prepping the text for review tomorrow (Friday), and I'll make the initial request for help from native speakers within the Foundation, and hopefully have instructions to send out to them on Monday.
    • Fun fact: Python uses lowercase \u for 4-digit UTF-8 codes: \u2620 (☠) and uppercase \U for 8-digit UTF-16 codes: \U0001F480 (💀). This solves the problem of high surrogates and low surrogates—but creates a problem for the programmer who doesn't know about it. (Of course I know him—he's me!)

What we've accomplished

WDQS Graph Split

WDQS Expose RDF stream publicly

  • Finished the work to adapt the flink jobs to event platform best practices... deployment might be tedious, I switched to the page_change stream instead of the old rev-create/page-[delete|suppress] streams and thus might need a carefull/manual deploy... - https://phabricator.wikimedia.org/T374918
    • still working on few config patches to declare the stream in mw-config

Search Update Pipeline / Weighted tags

Search Metrics

Misc / Operations