Jump to content

Search/WeightedTags

From Wikitech

Definition

CirrusSearch provides a way to store structured data in the indices powering full-text search on the wikis. This feature is useful in the following circumstances:

  • Store and search for data that is not owned/controlled by Mediawiki but can be attached/attributed to a page
  • This data is too expensive to be computed synchronously during the MediaWiki update process
  • This data is structured, when searching the user or process knows exactly what to search for (codes, IDs, not natural language)
  • This data is relatively stable, a small portion of the wiki pages might be required to be reindexed hourly (please ask when in doubt)
  • This data can be lost, CirrusSearch is not a primary datastore and this data must be retrievable from somewhere else

Adding new data

The CirrusSearch data-pipeline (Discovery/Analytics) running in the analytics cluster can be used to process and push some data to add to the search indices. The high level picture of the process is:

  • A process produces data to an EventPlatform stream
  • The CirrusSearch data-pipeline running in the analytics cluster will:
    • consume these streams hourly
    • join the updates from different streams related to the same document together
    • push this data back to the production elasticsearch indices serving search on the wikis

Producing the data using the Event Platform

CirrusSearch requires at least the following information to update the search index:

  • the wiki database name
  • the page id (and the revision id if possible)
  • the namespace of the page
  • the payload (the data to store)

The Event Platform provides all the necessary tools to design and produce such events.

Example 1: using MediaWiki and the WeightedTagsUpdater service

Setting tags to a page is pretty straightforward:

use CirrusSearch\WeightedTagsUpdater;

/**
 * Populate the weighted tags for $pageIdentity with two tags "one_tag", "second_tag" and
 * "third_tag" under the prefix "mytagprefix".
 *
 * @param ProperPageIdentity $pageIdentity
 * @return void
 */
public function update( ProperPageIdentity $pageIdentity ) {
	/** @var WeightedTagsUpdater $updater */
	$updater = MediaWikiServices::getInstance()->getService( WeightedTagsUpdater::SERVICE );
	// hint to tell CirrusSearch if the update relates to a new revision on $pageIdentity, in
	// such case CirrusSearch will attempt to join multiple updates related to this same page.
	$trigger = "revision";
	$updater->updateWeightedTags(
		$pageIdentity,
		"mytagprefix",
		[
			"one_tag" => 1, // the value is an optional weight from 1 to 1000
			"second_tag" => 3,
			"third_tag" => null,
		],
		$trigger // optional hint to indicate the reason this tag update happened
	);
}

Clearing the tags has a very similar process:

use CirrusSearch\WeightedTagsUpdater;

/**
 * Clear all the tags prefixed with "mytagprefix" for $pageIdentity
 * @param ProperPageIdentity $pageIdentity
 * @return void
 */
public function clear( ProperPageIdentity $pageIdentity ) {
	/** @var WeightedTagsUpdater $updater */
	$updater = MediaWikiServices::getInstance()->getService( WeightedTagsUpdater::SERVICE );
	// hint to tell CirrusSearch if the update relates to a new revision on $pageIdentity, in
	// such case CirrusSearch will attempt to join multiple updates related to this same page.
	$trigger = "revision";
	$updater->resetWeightedTags( $pageIdentity, "mytagprefix", $trigger );
}
the call to updateWeightedTags and resetWeightedTags might make a synchronous call to EventGate and thus it is strongly advised to do it in a POST_SEND deferred update.

Example 2: using Changeprop with ORES article/draft topic

This data is produced to the mediawiki.revision-score stream and conforms to the mediawiki/revision/score schema. It is using a custom Changeprop processor for shipping the data.

CirrusSearch Update Pipeline

Since phab:T366253, there is new, consolidated stream for adding an removing weighted tags: mediawiki.cirrussearch.page_weighted_tags_change.rc0.

A single event can set and/or clear weighted tags. Any tag listed under set will be merged with the existing ones. Any prefix under clear will clear all tags under that prefix.

Resetting the data from MediaWiki

In some scenario some tags might have to be deleted/reset after a user action is taken. For the recommendation use-case when a user refuses or make an edit after a recommendation is being presented to them the state of the tag for this page must be reset to avoid suggesting the same page again.

CirrusSearch provides a function that can be called from your process to do this:

$engine = MediaWikiServices::getInstance()->getSearchEngineFactory()->create();
Assert::precondition( $engine instanceof CirrusSearch, "CirrusSearch must be the default search engine" );
/** @var CirrusSearch $engine */
$pageToUpdate = Title::newFromText( 'Target Page' )->toPageIdentity();
// Schedules an asynchronous update to reset all tags under "my-custom-tag-family"
$engine->resetWeightedTags( $pageToUpdate, 'my-custom-tag-family' );

This will reset all tags under the my-custom-tag-family for the page Target Page by sending an asynchronous update request (near real time) to the search index.

Querying the data

Shape of the data in elasticsearch

The data lies within an index document as an array of strings where each entry represents a tag. In the elasticsearch source document the tag has the following shape tag_prefix/tag|score:

  • tag_prefix is the family or category of the tag
  • tag is the identifying value of the tag, beware that no text analysis is performed on this data and therefor will be case sensitive
  • score is an optional score as an integer (1 to 1000) that is encoded as the term frequency of the indexed token

Here is an exemple taken from the czech wikipedia:

{
  "weighted_tags": [
    "classification.ores.articletopic/STEM.Libraries & Information|699",
    "classification.ores.articletopic/STEM.STEM*|926",
    "classification.ores.articletopic/Culture.Media.Software|566",
    "recommendation.link/exists|1"
  ]
}

Which can be broken up as:

  • Family classification.ores.articletopic
    • tag STEM.Libraries & Information with a score of 699
    • tag STEM.STEM*, score 926
    • tag Culture.Media.Software, score 566
  • Family recommendation.link
    • tag exists, score of 1

Querying the tags

Tags must be searched with an elasticsearch match query on the weighted_tags fields using the full tag structure tag-family/tag-value minus the |score which is only read at index time:

{
  "match": {
    "weighted_tags": {
      "query": "recommendation.link/exists"
    }
  }
}

Will find all pages matching the tag. The score of the match query is equal to 0.0001 (the score given at index time is multiplied by 0.0001 to have a number between 0 and 1). But since the provided score is encoded as the term frequency the term_freq query can be used to perform interesting filtering:

{
    "term_freq": {
        "field": "weighted_tags",
        "term": "classification.ores.articletopic/STEM.STEM*",
        "gte": 900
    }
}

Will find pages for which the STEM.STEM* topic has a score greater than or equal to 900.

Within CirrusSearch a filtering keyword can be added to allow users/bots to filter pages whose match a particular tag, for instance see HasRecommendationFeature.php the code behind the hasrecommendation: search keyword. This is useful to combine filtering with other criterias indexed by CirrusSearch (i.e. categories, templates, text...).

If you own a custom fulltext query builder (e.g. MediaSearch, WikibaseCirrusSearch) the weighted_tags field can be used too.

Known tag families

family owner[1] known users[2] Event stream hive table usage in search
classification.ores.articletopic ML Growth mediawiki.revision.score N/A keyword articletopic:
classification.ores.drafttopic ML N/A mediawiki.revision.score N/A keyword drafttopic:
recommendation.link Growth Growth mediawiki.revision.recommendation-create N/A keyword hasrecommendation:
recommendation.image SDAW Growth N/A analytics_platform_eng.image_suggestions_search_index_delta keyword hasrecommendation:
image.linked.from.wikidata.p18 SDAW SDAW N/A analytics_platform_eng.image_suggestions_search_index_delta keyword custommatch:depicts_or_linked_from= and when searching the File namespace on any wiki with the WikibaseMediaInfo extension enabled (atm that's commons)
image.linked.from.wikidata.p373 SDAW SDAW N/A analytics_platform_eng.image_suggestions_search_index_delta
image.linked.from.wikipedia.lead_image SDAW SDAW N/A analytics_platform_eng.image_suggestions_search_index_delta
  1. Team responsible for producing the data
  2. Known teams relying on this data in the search index for their product