Search Platform/Weekly Updates/2024-11-01
Appearance
Ongoing work
WDQS Graph Split
- Working on adapting the sparql query logging to the split graph endpoints - https://phabricator.wikimedia.org/T376134
Language Stuff: Kuromoji
- I think I've optimized the Kuromoji settings to fix the known problems that don't require a native speaker to understand. I'm working on a tool to parse example sentences and queries from Japanese Wikipedia and Wiktionary. And because I want to refresh some skills, I'm doing it in Python. (It's going, but it's a little slow.) I'm also comparing the ICU tokenizer vs the Kuromoji tokenizer in case that can solve other problems once speakers are involved. - https://phabricator.wikimedia.org/T318269
- I hope to be done with the tokenizing and prepping the text for review tomorrow (Friday), and I'll make the initial request for help from native speakers within the Foundation, and hopefully have instructions to send out to them on Monday.
- Fun fact: Python uses lowercase \u for 4-digit UTF-8 codes: \u2620 (☠) and uppercase \U for 8-digit UTF-16 codes: \U0001F480 (💀). This solves the problem of high surrogates and low surrogates—but creates a problem for the programmer who doesn't know about it. (Of course I know him—he's me!)
What we've accomplished
WDQS Graph Split
- New hackathon scheduled by scholia folks: https://www.wikidata.org/wiki/Wikidata:Scholia/Events/Hackathon_November_2024
- T373812 Internal federation sometimes fail with HttpConnectionOverHTTP
WDQS Expose RDF stream publicly
- Finished the work to adapt the flink jobs to event platform best practices... deployment might be tedious, I switched to the page_change stream instead of the old rev-create/page-[delete|suppress] streams and thus might need a carefull/manual deploy... - https://phabricator.wikimedia.org/T374918
- still working on few config patches to declare the stream in mw-config
Search Update Pipeline / Weighted tags
- Tansitioned processing of weighted tags, created by Growth, to Search Update Pipeline; no more direct writes from CirrusSearch to Elasticsearch - https://phabricator.wikimedia.org/T377150 / https://phabricator.wikimedia.org/T372904
- Finished work on spark-kafka-writer (with schema validation)
- Created upstream ticket (https://issues.apache.org/jira/browse/SPARK-50160) + PR (https://github.com/apache/spark/pull/48695) but neigther has seen any feedback yet
Search Metrics
- T376161 Classify fulltext search abandonment: sampling - see https://superset.wikimedia.org/superset/dashboard/search/
Misc / Operations
- T378227 Investigate failed Cirrus index build services on mwmaint2002
- T374118 Datahub - ingest Hive discovery database
- T375557 Reindex all wikis to enable folding harmonization and new functionality
- T376715 TypeError: Argument 3 passed to CirrusSearch\DataSender::sendWeightedTagsUpdate() must be of the type array, null given, called in /srv/mediawiki/php-1.43.0-wmf.25/extensions/CirrusSearch/includes/Job/ElasticaWrite.php on line