Search Platform/Accountability
Appearance
High level
- Search expertise: We understand how Search works. If you want to change something around Search, from simple UI changes to complex new features, please talk to us first. Search is more complex than you probably think, with ramification into performance, data collection and metrics.
- Wikidata Query Service (WDQS): We operate Wikidata Query Service. We know how it works, its strenghts and it weaknesses. We are the technical owners of the service. The whole Wikidata product vision is owned by WMDE.
- Wikidata Commons Query Service (WCQS): Similarly to WDQS, we own the technical operations of WCQS. The product vision is not owned by anyone at the moment.
Technical components
Search
Servers
- elastic*: production Elasticsearch cluster, backend for Search. Incidentally also hosts indices for APIFeatureUsage and Toolhub.
- relforge*: non production hosts, used to validate relevance work.
- cloudelastic*: exposes a copy of the Search indices, for use from Toolforge / WMCS. The main use case is https://global-search.toolforge.org/.
- searchloader*: Mjolnir daemon that does the data transfer between the production Search clusters and the analytics network.
Other components
- Analytics jobs: we run a number of jobs to populate the Search indices. We are responsible for the jobs themselves and deploying them, but not for the underlying infrastructure (Airflow, Hadoop, etc...)
- Search Update Pipeline: All data ingestion into search indices, including Page mutations (creation, edits, deletions), document enrichment (ORES topics, add link, image recommendation). The Search Platform team is also responsible for defining the format in which additional data sources are ingested into the update pipeline.
Query Services
Servers
- wdqs*: Wikidata Query Service (both internal and public facing clusters + 2 test servers)
- wcqs*: Wikimedia Commons Query Service
Other components
- Update Pipeline: We operate an update pipeline based on Flink and running on the Wikikube k8s cluster. We are accountable for the update pipeline itself, but the underlying Flink and k8s infrastructure are owned by other teams.
Misc
For historical reasons, the Search Platform team owns a few components that are not strictly related to its mission:
Servers
- apifeatureusage*: logstash servers used to route APIFeatureUsage traffic to Elasticsearch indices.
Other components
- Geodata: Mediawiki extension.
Code repositories
- https://gerrit.wikimedia.org/g/operations/software/elasticsearch/madvise: Small utility to configure OS level read cache sizes on elasticsearch processes.
- https://gerrit.wikimedia.org/g/operations/software/elasticsearch/plugins: Debian package to bundle elasticsearch plugins needed for search.
- https://gitlab.wikimedia.org/repos/search-platform/mjolnir: MjoLniR is a library for handling the backend data processing for Machine Learned Ranking at Wikimedia. It is specialized to how click logs are stored at Wikimedia and provides functionality to transform the source click logs into ML models for ranking in elasticsearch.
- https://gerrit.wikimedia.org/g/search/MjoLniR/deploy: scap3 deployment configuration for MjoLniR data pipeline.
- https://gerrit.wikimedia.org/g/search/cirrus-streaming-updater: Flink based update pipeline for Search.
- https://gerrit.wikimedia.org/g/search/es-load-test: Utilities to load test elasticsearch (obsolete).
- https://gerrit.wikimedia.org/g/search/extra: The plan is for this to include any extra queries, filters, native scripts, score functions, and anything else we think we end up creating to make search nice for Wikimedia. Apache licensed.
- https://gerrit.wikimedia.org/g/search/extra-analysis: This is a collection of GNU General Public License (GPL) Elasticsearch analysis plugins (currently at n = 1) built around other GPL-licensed open-source morphological analysis software (e.g., stemmers and such).
- https://gerrit.wikimedia.org/g/search/glent: Query suggestions.
- https://gerrit.wikimedia.org/g/search/highlighter: Text highlighter for Java designed to be pluggable enough for easy experimentation. The idea being that it should be possible to play with how hits are weighed or how they are grouped into snippets without knowing about the guts of Lucene or Elasticsearch.
- https://gerrit.wikimedia.org/g/wikidata/query/flink-rdf-streaming-updater: This repo contains the configuration for a Flink Session Cluster. The blubber file produces a docker image that can be deployed in the WMF Kubernetes cluster. Changes in this repo trigger the WMF pipeline to build a new docker image which is stored in the WMF Docker registry.
- https://gerrit.wikimedia.org/g/wikidata/query/rdf: W[CD]QS and supporting code.
- https://gerrit.wikimedia.org/g/wikidata/query/LDFServer: LDF endpoint deployed on WDQS. Fork of https://github.com/LinkedDataFragments/Server.Java
- https://gerrit.wikimedia.org/g/wikidata/query/blazegraph: Fork of https://github.com/blazegraph/database with various fixes for the WMF use case.
- https://gerrit.wikimedia.org/g/wikidata/query/deploy: deployment repository for W[CD]QS.
- https://gerrit.wikimedia.org/g/wikidata/query/flink-rdf-streaming-updater: This repo contains the configuration for a Flink Session Cluster. The blubber file produces a docker image that can be deployed in the WMF Kubernetes cluster. Changes in this repo trigger the WMF pipeline to build a new docker image which is stored in the WMF Docker registry.
- https://gerrit.wikimedia.org/g/wikidata/query/flink-swift-plugin: This project is archived. See https://phabricator.wikimedia.org/T314273.
- TBD: CirrusSearch and other mediawiki extensions.